GNU bug report logs - #23763
Bug report: Grep stops, if a text file contains a null character after 32768 bytes

Previous Next

Package: grep;

Reported by: Bjoern Voigt <bjoernv <at> arcor.de>

Date: Mon, 13 Jun 2016 19:55:02 UTC

Severity: normal

Tags: notabug

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 23763 in the body.
You can then email your comments to 23763 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#23763; Package grep. (Mon, 13 Jun 2016 19:55:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Bjoern Voigt <bjoernv <at> arcor.de>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Mon, 13 Jun 2016 19:55:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Bjoern Voigt <bjoernv <at> arcor.de>
To: bug-grep <at> gnu.org
Subject: Bug report: Grep stops, if a text file contains a null character
 after 32768 bytes
Date: Mon, 13 Jun 2016 21:45:30 +0200
Grep shows a bug, if it processes a text file with at least one embedded
0 (ASCII zero) character after byte 32768. Grep stops with the error
message "Binary file testfile.txt matches" and exit code 0. The error
message is written to standard output. Any line after the 0 character is
silently ignored in output.

Environment:
- grep-2.25
- no patches, no "configure" options
- openSUSE Tumbleweed 20160611 x86_64; glibc 2.23; libpcre 8.38

I saw this bug first, as I tried to filter out a line of the MySQL
backup utility "mysqldump". Because grep stopped at the 0 character, the
backups where incomplete.

# mysqldump --all-databases | grep -v '^-- Dump completed on'
[... around 240 lines of SQL output ...]
LOCK TABLES `PartTable` WRITE;
/*!40000 ALTER TABLE `PartTable` DISABLE KEYS */;
Binary file (standard input) matches
mysqldump: Got errno 32 on write

I found that the mysqldump output contains 0 characters in table PartTable.

I wrote the following test script, which shows the bug without a
dependency to MySQL:
--------------------------------------------------------
#!/bin/bash

testfile="testfile.txt"

# write a text file large enough (16384 lines is
# the minimum number for this test case)
for((i=1;i<=16384;i++)) do echo "A"; done > $testfile

# write a zero byte
echo -e '\0' >> $testfile

# write an end line
echo -e 'A ... the end' >> $testfile

# verify the file contents
ls -l $testfile
tail -n 10 $testfile

# use 'grep' to find all lines with the string "A"
grep "A" $testfile

# the last line is missing, the output ends with
# "Binary file testfile.txt matches"

# check the exit code
echo "Exit code of grep:" $?
--------------------------------------------------------

The last line "A ... the end" is missing in output of grep. The exit
code is 0:

# ./null-bug-testcase.txt
[...]
A
A
A
Binary file testfile.txt matches
Exit code of grep: 0

I also found this bug in older grep versions (e.g. Ubuntu 14.04; grep 2.16).

FreeBSD's version of grep (tested with 2.5.1-FreeBSD under FreeBSD
10.3-RELEASE-p4) does not show the bug:

#./null-bug-testcase.txt
[...]
A
A
A
A ... the end
Exit code of grep: 0

Regards,
Björn




Added tag(s) notabug. Request was from Eric Blake <eblake <at> redhat.com> to control <at> debbugs.gnu.org. (Mon, 13 Jun 2016 20:02:01 GMT) Full text and rfc822 format available.

Reply sent to Eric Blake <eblake <at> redhat.com>:
You have taken responsibility. (Mon, 13 Jun 2016 20:02:02 GMT) Full text and rfc822 format available.

Notification sent to Bjoern Voigt <bjoernv <at> arcor.de>:
bug acknowledged by developer. (Mon, 13 Jun 2016 20:02:02 GMT) Full text and rfc822 format available.

Message #12 received at 23763-done <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Bjoern Voigt <bjoernv <at> arcor.de>, 23763-done <at> debbugs.gnu.org
Subject: Re: bug#23763: Bug report: Grep stops, if a text file contains a null
 character after 32768 bytes
Date: Mon, 13 Jun 2016 14:01:28 -0600
[Message part 1 (text/plain, inline)]
tag 23763 notabug
thanks

On 06/13/2016 01:45 PM, Bjoern Voigt wrote:
> Grep shows a bug, if it processes a text file with at least one embedded
> 0 (ASCII zero) character after byte 32768.

Thanks for the report.  However, this is not a bug in grep, but
documented behavior.  By definition, a text file CANNOT contain NUL
bytes; any file with NUL characters is a binary file.  You can still
make grep process it as a text file, but only with the '-a' flag.

> Grep stops with the error
> message "Binary file testfile.txt matches" and exit code 0. The error
> message is written to standard output. Any line after the 0 character is
> silently ignored in output.

POSIX allows this behavior, in that it says that grep's behavior is
undefined on non-text files (which you have by virtue of your NUL byte).

Since this is documented behavior of GNU grep when -a is not used, I'm
closing this as not a bug. But feel free to add further comments to this
thread.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#23763; Package grep. (Mon, 13 Jun 2016 20:53:02 GMT) Full text and rfc822 format available.

Message #15 received at 23763 <at> debbugs.gnu.org (full text, mbox):

From: Bjoern Voigt <bjoernv <at> arcor.de>
To: 23763 <at> debbugs.gnu.org
Cc: Eric Blake <eblake <at> redhat.com>
Subject: Re: bug#23763: Bug report: Grep stops, if a text file contains a null
 character after 32768 bytes
Date: Mon, 13 Jun 2016 22:52:38 +0200
Eric Blake wrote:
> POSIX allows this behavior, in that it says that grep's behavior is
> undefined on non-text files (which you have by virtue of your NUL
> byte). Since this is documented behavior of GNU grep when -a is not
> used, I'm closing this as not a bug. But feel free to add further
> comments to this thread. 
If I start grep with the "-a" option or "--binary=text", the bug does
not show up.

"grep --binary-files=binary" which is the default shows the bug.

I am relatively sure, that the auto guessing code is incorrect or
limited, if a null character is found after 32KB. The manual page says
about the auto guessing code:

       -U, --binary
              Treat  the  file(s) as binary.  By default, under MS-DOS
and MS-
              Windows, grep guesses the file type by looking at  the 
contents
              of  the first 32KB read from the file.  If grep decides
the file
              is a text file, it strips the CR characters  from  the 
original
              file  contents  (to  make  regular expressions with ^ and
$ work
              correctly).  Specifying -U overrules this guesswork,
causing all
              files  to be read and passed to the matching mechanism
verbatim;
              if the file is a text file with CR/LF pairs at the end 
of  each
              line,  this  will  cause some regular expressions to
fail.  This
              option has no effect on platforms  other  than  MS-DOS 
and  MS-
              Windows.

I see these problems:

 1. The binary mode is implemented inconsistent. It would be acceptable,
    if grep produces none (no match, exit code >0) or exactly one output
    line ("Binary file testfile.txt matches", exit code 0). It is not
    acceptable, that grep writes some matching text lines and later
    "Binary file testfile.txt matches" and exits with code 0.
 2. Linux or more precisely None-MS-DOS and None-MS-Windows users will
    oversee the auto guessing section in manual page, because of the
    notes "By default, under MS-DOS and MS-Windows, grep guesses the
    file type by looking at  the  contents of  the first 32KB read from
    the file." and "This option has no effect on platforms  other  than 
    MS-DOS  and  MS-Windows."
 3. The auto-guessing mechanism is not documented somewhere else in the
    documentation.
 4. The auto guessing limitations are somehow documented in the manual
    page, but not in the BUGS section.
 5. The exit code should not be 0, if grep founds an error in input
    which it can't recover.
 6. The error message "Binary file testfile.txt matches" must not be
    written on standard output, if matching text lines are written before.
 7. POSIX defines minimal assurances for grep. Of course GNU grep can or
    should be better.
 8. Other implementations (like the tested FreeBSD version) do not show
    the bug. Also busybox works correctly.





Information forwarded to bug-grep <at> gnu.org:
bug#23763; Package grep. (Mon, 13 Jun 2016 22:20:02 GMT) Full text and rfc822 format available.

Message #18 received at 23763 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Bjoern Voigt <bjoernv <at> arcor.de>, 23763 <at> debbugs.gnu.org
Subject: Re: bug#23763: Bug report: Grep stops, if a text file contains a null
 character after 32768 bytes
Date: Mon, 13 Jun 2016 15:19:10 -0700
[Message part 1 (text/plain, inline)]
On 06/13/2016 01:52 PM, Bjoern Voigt wrote:
> The manual page says
> about the auto guessing code:

That's a typo in the man page, and I installed the attached patch to fix 
it. This should address the first four points you mentioned. As for the 
remaining points, grep does not consider binary data to be an error. 
Although there is a judgment call as to whether a matching-lines 
notification should be sent to stdout or stderr when input contains 
binary data, grep has been behaving this way for some time (GNU diff 
even longer) and it would be a hassle to change it at this point.

For GNU grep, you should be able to work around the issue by using the 
-a option. Other grep implementations may or may not work; in my 
experience, sending NUL bytes to them can sometimes make them dump core 
or artificially truncate their output.

[0001-doc-remove-obsolete-MS-DOS-mention.patch (application/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#23763; Package grep. (Tue, 14 Jun 2016 08:27:02 GMT) Full text and rfc822 format available.

Message #21 received at 23763 <at> debbugs.gnu.org (full text, mbox):

From: Bjoern Voigt <bjoernv <at> arcor.de>
To: 23763 <at> debbugs.gnu.org
Cc: Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: bug#23763: Bug report: Grep stops, if a text file contains a null
 character after 32768 bytes
Date: Tue, 14 Jun 2016 10:26:09 +0200
What about the inconsistent output?

Grep should not print a mixture of text matches and then exits with a
binary match and exit code 0:

# ./null-bug-testcase.txt
[...]
A
A
A
Binary file testfile.txt matches
Exit code of grep: 0

This is clearly a bug in my eyes.

Is a patch welcome, which fixes this inconsistency? Currently I am
analyzing grep in debugging sessions.




Information forwarded to bug-grep <at> gnu.org:
bug#23763; Package grep. (Tue, 14 Jun 2016 16:28:01 GMT) Full text and rfc822 format available.

Message #24 received at 23763 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Bjoern Voigt <bjoernv <at> arcor.de>, 23763 <at> debbugs.gnu.org
Subject: Re: bug#23763: Bug report: Grep stops, if a text file contains a null
 character after 32768 bytes
Date: Tue, 14 Jun 2016 09:27:41 -0700
Bjoern Voigt wrote:
> This is clearly a bug in my eyes.

The behavior conforms to grep's spec, so it's not a bug in that sense. I don't 
offhand see a behavior change that wouldn't cause worse problems elsewhere. 
Unless you were thinking of adding an option?




Information forwarded to bug-grep <at> gnu.org:
bug#23763; Package grep. (Tue, 14 Jun 2016 20:11:01 GMT) Full text and rfc822 format available.

Message #27 received at 23763 <at> debbugs.gnu.org (full text, mbox):

From: Bjoern Voigt <bjoernv <at> arcor.de>
To: 23763 <at> debbugs.gnu.org
Cc: Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: bug#23763: Bug report: Grep stops, if a text file contains a null
 character after 32768 bytes
Date: Tue, 14 Jun 2016 22:10:27 +0200
Paul Eggert wrote:
> Bjoern Voigt wrote:
>> This is clearly a bug in my eyes.
>
> The behavior conforms to grep's spec, so it's not a bug in that sense.
> I don't offhand see a behavior change that wouldn't cause worse
> problems elsewhere. Unless you were thinking of adding an option?
The current manual page patched with
"0001-doc-remove-obsolete-MS-DOS-mention-2.patch" says:

--binary-files=TYPE
  If the first few bytes of a file indicate that the file
  contains binary data, assume that the file is of type TYPE.  By
  default, TYPE is binary, and grep normally outputs either a
  one-line message saying that a binary file matches, or no
  message if there is no match.  If TYPE is without-match, grep
  assumes that a binary file does not match; this is equivalent
  to the -I option.  If TYPE is text, grep processes a binary
  file as if it were text; this is equivalent to the -a option.
  When processing binary data, grep may treat non-text bytes as
  line terminators; for example, the pattern '.'
  (period) might not match a null byte, as the null byte might be
  treated as a line terminator.  Warning: grep
  --binary-files=text might output binary garbage, which can have
  nasty side effects if the output is a terminal and if the
  terminal driver interprets some of it as commands.

My test case where a files starts with more than 32KB text data and
continues with text data with at least one embedded 0 character (which
makes this binary data) is undocumented.

Consequently I probably search a new option "--binary-files=auto" which
also should by the default sometime later.

For files it should work as follows:

--binary-files=auto
If the first few bytes of a file indicate that the file
contains binary data, assume that the file is of type binary.
Otherwise assume that the file is of type text.

Since the behavior of --binary-files=binary for my testcase is
undocumented and since the output is more or less useless except of the
fact that some not-printable characters on terminal are suppressed, it
would be also an option to change --binary-files=binary mode in code and
in the manual page.

For files as input data this is easy to implement. But I haven't
checked, how --binary-files should work with standard input. The
decision binary or text should be made there before the first match is
printed.

My MySQL mysqldump problem can be solved with --text or
--binary-files=text. So I do not search a quick solution anymore.

Regards,
Björn






Information forwarded to bug-grep <at> gnu.org:
bug#23763; Package grep. (Wed, 15 Jun 2016 05:31:02 GMT) Full text and rfc822 format available.

Message #30 received at 23763 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Bjoern Voigt <bjoernv <at> arcor.de>, 23763 <at> debbugs.gnu.org
Subject: Re: bug#23763: Bug report: Grep stops, if a text file contains a null
 character after 32768 bytes
Date: Tue, 14 Jun 2016 22:30:13 -0700
[Message part 1 (text/plain, inline)]
Bjoern Voigt wrote:
> --binary-files=TYPE
>    If the first few bytes of a file indicate that the file
>    contains binary data, assume that the file is of type TYPE.

That's another place where the man page is obsolete and wrong. (In GNU projects, 
man pages are often poorly maintained as they are not the primary form of 
documentation; one is supposed to read the manual instead.) I installed the 
attached patch to fix that.

> My MySQL mysqldump problem can be solved with --text or
> --binary-files=text. So I do not search a quick solution anymore.

Works for me.

[0001-doc-propagate-more-changes-from-grep.texi.txt (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#23763; Package grep. (Wed, 15 Jun 2016 10:49:02 GMT) Full text and rfc822 format available.

Message #33 received at 23763 <at> debbugs.gnu.org (full text, mbox):

From: sur-behoffski <sur_behoffski <at> grouse.com.au>
To: 23763 <at> debbugs.gnu.org
Subject: Re: bug#23763: Bug report: Grep stops, if a text file contains a null
 character after 32768 bytes
Date: Wed, 15 Jun 2016 20:18:41 +0930
On 06/15/16 15:00, Paul Eggert wrote:
> Bjoern Voigt wrote:
>> --binary-files=TYPE
>>    If the first few bytes of a file indicate that the file
>>    contains binary data, assume that the file is of type TYPE.
>
> That's another place where the man page is obsolete and wrong. (In GNU projects, man pages are often poorly maintained as they are not the primary form of documentation; one is supposed to read the manual instead.) I installed the attached patch to fix that.
>
>> My MySQL mysqldump problem can be solved with --text or
>> --binary-files=text. So I do not search a quick solution anymore.
>
> Works for me.
>

G'day,

Fairly pedantic comment:  I try to keep reserve the term "null" to use in
a pointer context (as NULL), and to use ASCII NUL for the zero character
('\0').  [Just checked, EBCDIC also uses NUL as its name for character
value 0.]

My experience (admittedly somewhat dated, and/or for smaller architectures)
is that preserving the distinction is valuable.

Are there documentation standards, especially GNU ones, that cover this
distinction?  If not, is it worth striving to gradually introduce this
in a systematic manner, e.g. "NUL (the zero character)"?

cheers,

sur-behoffski (Brenton Hoff)
Programmer, Grouse Software




Information forwarded to bug-grep <at> gnu.org:
bug#23763; Package grep. (Wed, 15 Jun 2016 18:30:02 GMT) Full text and rfc822 format available.

Message #36 received at 23763 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: sur-behoffski <sur_behoffski <at> grouse.com.au>, 23763 <at> debbugs.gnu.org
Subject: Re: bug#23763: Bug report: Grep stops, if a text file contains a null
 character after 32768 bytes
Date: Wed, 15 Jun 2016 11:29:51 -0700
sur-behoffski wrote:
> Are there documentation standards, especially GNU ones, that cover this
> distinction?  If not, is it worth striving to gradually introduce this
> in a systematic manner, e.g. "NUL (the zero character)"?

I don't know of any terminology standards in this area. (Wikipedia does not 
count. :-)




Information forwarded to bug-grep <at> gnu.org:
bug#23763; Package grep. (Wed, 15 Jun 2016 18:38:02 GMT) Full text and rfc822 format available.

Message #39 received at 23763 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>,
 sur-behoffski <sur_behoffski <at> grouse.com.au>, 23763 <at> debbugs.gnu.org
Subject: Re: bug#23763: Bug report: Grep stops, if a text file contains a null
 character after 32768 bytes
Date: Wed, 15 Jun 2016 12:37:34 -0600
[Message part 1 (text/plain, inline)]
On 06/15/2016 12:29 PM, Paul Eggert wrote:
> sur-behoffski wrote:
>> Are there documentation standards, especially GNU ones, that cover this
>> distinction?  If not, is it worth striving to gradually introduce this
>> in a systematic manner, e.g. "NUL (the zero character)"?
> 
> I don't know of any terminology standards in this area. (Wikipedia does
> not count. :-)

The POSIX standard tries to consistently use NUL for the character name
(whether you are using a unibyte encoding and it fits in char, or a wide
encoding where it fits in wchar_t), a 'null byte' for a byte that is all
zeroes (which happens to be the NUL character in both unibyte and in
multibyte encodings, since no other multibyte character is allowed to
have an embedded null byte), and a 'null pointer' when referring to a
pointer to nowhere (the constant NULL is a null pointer, as is the C
expression '((void*)0)', although the null pointer need not have an
all-zero bit representation in hardware).

http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html
in particular 3.243-3.245

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 14 Jul 2016 11:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 7 years and 301 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.