GNU bug report logs - #20526
BUG: text file is detected as binary

Package: grep;

Reported by: Sebastian Poehn <sebastian.poehn <at> gmail.com>

Date: Thu, 7 May 2015 15:41:03 UTC

Severity: normal

Merged with 19230, 19985, 21558

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 20526 in the body.
You can then email your comments to 20526 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-grep <at> gnu.org:
bug#20526; Package grep. (Thu, 07 May 2015 15:41:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Sebastian Poehn <sebastian.poehn <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Thu, 07 May 2015 15:41:04 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Sebastian Poehn <sebastian.poehn <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: BUG: text file is detected as binary
Date: Thu, 07 May 2015 13:08:08 +0200

[Message part 1 (text/plain, inline)]

Fedora 21
grep (GNU grep) 2.21

Grep detects text file as Binary. File is attached.

file Makefile 
Makefile: ISO-8859 text

ack PKG_NAME
Makefile
10:PKG_NAME:=clearsilver
14:PKG_SOURCE:=$(PKG_NAME)-$(PKG_VERSION).tar.gz

grep --version ; grep "PKG_NAME" Makefile
grep (GNU grep) 2.7
...
PKG_NAME:=clearsilver
PKG_SOURCE:=$(PKG_NAME)-$(PKG_VERSION).tar.gz

grep --version ; grep "PKG_NAME" Makefile
grep (GNU grep) 2.21
...
Binary file Makefile matches

[Makefile (text/x-makefile, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#20526; Package grep. (Thu, 07 May 2015 16:24:02 GMT) Full text and rfc822 format available.

Message #8 received at 20526 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Sebastian Poehn <sebastian.poehn <at> gmail.com>, 20526 <at> debbugs.gnu.org
Subject: Re: bug#20526: BUG: text file is detected as binary
Date: Thu, 07 May 2015 09:23:27 -0700

That file uses ISO 8859 encoding (presumably Latin-1 or Latin-9), so you 
need to grep it in a locale compatible with that encoding.  It appears 
that you ran grep in a UTF-8 or other incompatible locale, which meant 
the ISO 8859 encoding wasn't valid and was treated as binary gibberish.  
You could try working around it with this:

grep -a PKG_NAME Makefile

or this:

LC_ALL=de_DE.iso885915 grep PKG_NAME Makefile

but in either case 'grep' might output the binary gibberish, which could 
cause other problems.  So it might be better to change that non-ASCII 
character in the file's string "Raphaël" to use an encoding compatible 
with the encoding of your locale.

Information forwarded to bug-grep <at> gnu.org:
bug#20526; Package grep. (Thu, 07 May 2015 17:48:01 GMT) Full text and rfc822 format available.

Message #11 received at 20526 <at> debbugs.gnu.org (full text, mbox):

From: Sebastian Pöhn <sebastian.poehn <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Pöhn, Sebastian <sebastian.poehn <at> gmail.com>,
 20526 <at> debbugs.gnu.org
Subject: Re: bug#20526: BUG: text file is detected as binary
Date: Thu, 7 May 2015 19:47:25 +0200

[Message part 1 (text/plain, inline)]

Thanks for this fast feedback. Your explanation sounds very reasonable. As
you may have noticed this a makefile out of openwrt with is mainlined there.

1) I downgraded to grep 2.20. Issue is gone with the same environment. So
this is in my eyes a regression.

2) I will also open a report at fedora, maybe the use some strange setting
in building the new packet.

3) I will send a short notice to openwrt asking if they think it is fine to
use ë or ö. I personally have a strong opinion on that ;)
Am 07.05.2015 6:23 nachm. schrieb "Paul Eggert" <eggert <at> cs.ucla.edu>:

> That file uses ISO 8859 encoding (presumably Latin-1 or Latin-9), so you
> need to grep it in a locale compatible with that encoding.  It appears that
> you ran grep in a UTF-8 or other incompatible locale, which meant the ISO
> 8859 encoding wasn't valid and was treated as binary gibberish.  You could
> try working around it with this:
>
> grep -a PKG_NAME Makefile
>
> or this:
>
> LC_ALL=de_DE.iso885915 grep PKG_NAME Makefile
>
> but in either case 'grep' might output the binary gibberish, which could
> cause other problems.  So it might be better to change that non-ASCII
> character in the file's string "Raphaël" to use an encoding compatible with
> the encoding of your locale.
>

[Message part 2 (text/html, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#20526; Package grep. (Thu, 07 May 2015 19:12:01 GMT) Full text and rfc822 format available.

Message #14 received at 20526 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Sebastian Pöhn <sebastian.poehn <at> gmail.com>,
 Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 20526 <at> debbugs.gnu.org, =?UTF-8?Q?P=C3=B6hn <at> debbugs.gnu.org
Subject: Re: bug#20526: BUG: text file is detected as binary
Date: Thu, 07 May 2015 13:11:49 -0600

[Message part 1 (text/plain, inline)]

On 05/07/2015 11:47 AM, Sebastian Pöhn wrote:
> Thanks for this fast feedback. Your explanation sounds very reasonable. As
> you may have noticed this a makefile out of openwrt with is mainlined there.
> 
> 1) I downgraded to grep 2.20. Issue is gone with the same environment. So
> this is in my eyes a regression.

No, it is a bug fix, and documented in the NEWS:

  If a file contains data improperly encoded for the current locale,
  and this is discovered before any of the file's contents are output,
  grep now treats the file as binary.

> 
> 2) I will also open a report at fedora, maybe the use some strange setting
> in building the new packet.

But as the change is intentional, there is probably nothing that Fedora
would do about it.

> 
> 3) I will send a short notice to openwrt asking if they think it is fine to
> use ë or ö. I personally have a strong opinion on that ;)

It would be fine if they would recode their file to use UTF-8, as that
is pretty much a standard encoding these days.  Latin-1 files are
getting harder and harder to process, as more people move to multibyte
UTF-8 locales.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#20526; Package grep. (Thu, 07 May 2015 20:08:02 GMT) Full text and rfc822 format available.

Message #17 received at 20526 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Sebastian Pöhn <sebastian.poehn <at> gmail.com>
Cc: 20526 <at> debbugs.gnu.org
Subject: Re: bug#20526: BUG: text file is detected as binary
Date: Thu, 07 May 2015 13:07:20 -0700

[Message part 1 (text/plain, inline)]

On 05/07/2015 10:47 AM, Sebastian Pöhn wrote:
>
> Thanks for this fast feedback. Your explanation sounds very 
> reasonable. As you may have noticed this a makefile out of openwrt 
> with is mainlined there.
>
> 1) I downgraded to grep 2.20. Issue is gone with the same environment. 
> So this is in my eyes a regression.
>

Not really, as Openwrt is relying on undefined behavior.  The spec for 
grep has never defined what grep does when you feed it binary data that 
is not properly encoded for the current locale.  Different versions of 
grep (and we're not just talking GNU grep here, but other 
implementations) do different things.  Some grep implementations dump 
core.  These behaviors all conform to the spec.  (Well, GNU grep isn't 
supposed to dump core, but older versions of GNU grep are buggy and will 
dump core sometimes anyway, so you'll need good luck with them.)

> 2) I will also open a report at fedora, maybe the use some strange 
> setting in building the new packet.
>

Nowadays most people are using UTF-8, so I suggest encoding the 
Makefiles in UTF-8 and specifying a UTF-8 locale when you build. Another 
possibility is the attached hack (I haven't tried it).  The most 
conservative course would be to insist that Makefiles be ASCII, although 
....

> 3) I will send a short notice to openwrt asking if they think it is 
> fine to use ë or ö. I personally have a strong opinion on that ;)
>

Don't blame you a bit.

[openwrt.diff (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#20526; Package grep. (Fri, 08 May 2015 07:30:06 GMT) Full text and rfc822 format available.

Message #20 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Johannes Meixner <jsmeix <at> suse.de>
To: Sebastian Poehn <sebastian.poehn <at> gmail.com>
Cc: bug-grep <at> gnu.org, 20526 <at> debbugs.gnu.org
Subject: Re: bug#20526: BUG: text file is detected as binary
Date: Fri, 8 May 2015 09:29:11 +0200 (CEST)

Hello,

only an addendum FYI:

On May 7 09:23 Paul Eggert wrote (excerpt):
> That file uses ISO 8859 encoding (presumably Latin-1 or Latin-9),
> so you need to grep it in a locale compatible with that encoding.

For some general information about that kind of issue have a look at
https://en.opensuse.org/SDB:Plain_Text_versus_Locale


Kind Regards
Johannes Meixner
-- 
SUSE LINUX GmbH - GF: Felix Imendoerffer, Jane Smithard, Jennifer Guild,
Dilip Upmanyu, Graham Norton - HRB 21284 (AG Nuernberg)

Information forwarded to bug-grep <at> gnu.org:
bug#20526; Package grep. (Fri, 08 May 2015 07:30:10 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#20526; Package grep. (Fri, 08 May 2015 07:41:03 GMT) Full text and rfc822 format available.

Message #26 received at 20526 <at> debbugs.gnu.org (full text, mbox):

From: Sebastian Poehn <sebastian.poehn <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 20526 <at> debbugs.gnu.org,
 Sebastian Pöhn <sebastian.poehn <at> gmail.com>,
 Eric Blake <eblake <at> redhat.com>
Subject: Re: bug#20526: BUG: text file is detected as binary
Date: Fri, 08 May 2015 09:40:46 +0200

On Thu, 2015-05-07 at 13:07 -0700, Paul Eggert wrote:
> On 05/07/2015 10:47 AM, Sebastian Pöhn wrote:
> >
> > Thanks for this fast feedback. Your explanation sounds very 
> > reasonable. As you may have noticed this a makefile out of openwrt 
> > with is mainlined there.
> >
> > 1) I downgraded to grep 2.20. Issue is gone with the same environment. 
> > So this is in my eyes a regression.
> >
> 
> Not really, as Openwrt is relying on undefined behavior.  The spec for 
> grep has never defined what grep does when you feed it binary data that 
> is not properly encoded for the current locale.  Different versions of 
> grep (and we're not just talking GNU grep here, but other 
> implementations) do different things.  Some grep implementations dump 
> core.  These behaviors all conform to the spec.  (Well, GNU grep isn't 
> supposed to dump core, but older versions of GNU grep are buggy and will 
> dump core sometimes anyway, so you'll need good luck with them.)

Ok, agree. It's not a regression. It's just that we got a little
stricter.
> 
> > 2) I will also open a report at fedora, maybe the use some strange 
> > setting in building the new packet.
> >
> 
> Nowadays most people are using UTF-8, so I suggest encoding the 
> Makefiles in UTF-8 and specifying a UTF-8 locale when you build. Another 
> possibility is the attached hack (I haven't tried it).  The most 
> conservative course would be to insist that Makefiles be ASCII, although 
> ....
There is already a report for this. Let's see what they do.
> 
> > 3) I will send a short notice to openwrt asking if they think it is 
> > fine to use ë or ö. I personally have a strong opinion on that ;)
> >
> 
> Don't blame you a bit.

I checked openwrt upstream. They changed all Makefiles not being ASCII
to UTF-8 three months ago as they run into exactly this.

Information forwarded to bug-grep <at> gnu.org:
bug#20526; Package grep. (Fri, 08 May 2015 16:28:02 GMT) Full text and rfc822 format available.

Message #29 received at 20526 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Sebastian Poehn <sebastian.poehn <at> gmail.com>
Cc: Eric Blake <eblake <at> redhat.com>, 20526 <at> debbugs.gnu.org
Subject: Re: bug#20526: BUG: text file is detected as binary
Date: Fri, 08 May 2015 09:27:45 -0700

Sebastian Poehn wrote:
> They changed all Makefiles not being ASCII
> to UTF-8 three months ago as they run into exactly this.

Hah!  Great minds think alike.

But they missed a few files (not Makefiles).  The following shell command finds 
every openwrt file that's not UTF-8 (and isn't obviously binary).  It works 
because '.' matches only properly-encoded characters.  You may need a new GNU 
grep for this command to be reliable.

LC_ALL=en_US.utf8 grep -lv '^.*$' \
  $(git ls-files | grep -Ev '\.(patch|bin|squashfs)$')

Information forwarded to bug-grep <at> gnu.org:
bug#20526; Package grep. (Mon, 11 May 2015 11:06:02 GMT) Full text and rfc822 format available.

Message #32 received at 20526 <at> debbugs.gnu.org (full text, mbox):

From: Kamil Dudka <kdudka <at> redhat.com>
To: Eric Blake <eblake <at> redhat.com>
Cc: 20526 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>,
 Sebastian Pöhn <sebastian.poehn <at> gmail.com>,
 debbugs-submit <at> debbugs.gnu.org, =?UTF-8?Q?P=C3=B6hn <at> debbugs.gnu.org
Subject: Re: bug#20526: BUG: text file is detected as binary
Date: Mon, 11 May 2015 13:05:23 +0200

On Thursday 07 May 2015 13:11:49 Eric Blake wrote:
> On 05/07/2015 11:47 AM, Sebastian Pöhn wrote:
> > Thanks for this fast feedback. Your explanation sounds very reasonable. As
> > you may have noticed this a makefile out of openwrt with is mainlined
> > there.
> > 
> > 1) I downgraded to grep 2.20. Issue is gone with the same environment. So
> > this is in my eyes a regression.
> 
> No, it is a bug fix, and documented in the NEWS:
> 
>   If a file contains data improperly encoded for the current locale,
>   and this is discovered before any of the file's contents are output,
>   grep now treats the file as binary.

Which bug does it fix?

The upstream commit in question (cd36abd4) does not refer to any bug report.
Also the fact that the commit had to change existing regression tests to 
prevent them from failing suggests that it can be seen as a regression.

> > 2) I will also open a report at fedora, maybe the use some strange setting
> > in building the new packet.
> 
> But as the change is intentional, there is probably nothing that Fedora
> would do about it.

I already created a bug for Fedora:

https://bugzilla.redhat.com/1219141

Kamil

> > 3) I will send a short notice to openwrt asking if they think it is fine
> > to
> > use ë or ö. I personally have a strong opinion on that ;)
> 
> It would be fine if they would recode their file to use UTF-8, as that
> is pretty much a standard encoding these days.  Latin-1 files are
> getting harder and harder to process, as more people move to multibyte
> UTF-8 locales.

Information forwarded to bug-grep <at> gnu.org:
bug#20526; Package grep. (Tue, 12 May 2015 04:28:01 GMT) Full text and rfc822 format available.

Message #35 received at 20526 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Kamil Dudka <kdudka <at> redhat.com>, Eric Blake <eblake <at> redhat.com>
Cc: 20526 <at> debbugs.gnu.org,
 Sebastian Pöhn <sebastian.poehn <at> gmail.com>,
 debbugs-submit <at> debbugs.gnu.org, =?UTF-8?Q?P=C3=B6hn <at> debbugs.gnu.org
Subject: Re: bug#20526: BUG: text file is detected as binary
Date: Mon, 11 May 2015 21:27:35 -0700

Kamil Dudka wrote:
> Which bug does it fix?

I don't recall a bug report being filed for it, but the old grep behavior had 
real problems: as I remember at times it dumped core, and at other times it spit 
out improperly encoded data to the terminal.  We've fixed the core dumps I know 
about, though I think grep still outputs improperly encoded data at times (and 
this should get fixed too -- see below for a suggestion).

At any rate, applications could never assume a particular behavior for 
improperly encoded files, so the current behavior is clearly not a bug.  Users 
may be able to scrape along by setting LC_ALL=C before running 'grep' -- the 
problems LC_ALL=C runs into are about the same as the problems with using old 
'grep' (except that the new grep doesn't dump core :-).


Perhaps we can improve the behavior of grep by changing its heuristic slightly. 
 Currently grep reports "Binary file FOO matches" if it finds binary data in 
FOO before it finds the first match.  Instead, perhaps we could change grep to 
report "Binary file FOO matches" when it sees that it's about to generate binary 
*output* copied from FOO, regardless of whether this output represents the first 
match.  That is, when grep sees that it's about to output binary data, grep 
instead outputs "Binary file FOO matches" and then stops output for FOO (even if 
it already output some lines for ordinary matches in FOO).

This approach would fix the problem of grep trashing the output stream, and it 
should be less drastic than grep's current approach, in that it would make grep 
more likely to do what Kamil Dudka is asking for (assuming grep is given mostly 
valid input interspersed with small amounts of binary data).

Information forwarded to bug-grep <at> gnu.org:
bug#20526; Package grep. (Tue, 12 May 2015 08:43:02 GMT) Full text and rfc822 format available.

Message #38 received at 20526 <at> debbugs.gnu.org (full text, mbox):

From: Kamil Dudka <kdudka <at> redhat.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>, Eric Blake <eblake <at> redhat.com>
Cc: 20526 <at> debbugs.gnu.org,
 Sebastian Pöhn <sebastian.poehn <at> gmail.com>,
 debbugs-submit <at> debbugs.gnu.org, =?UTF-8?Q?P=C3=B6hn <at> debbugs.gnu.org
Subject: Re: bug#20526: BUG: text file is detected as binary
Date: Tue, 12 May 2015 10:41:53 +0200

On Monday 11 May 2015 21:27:35 Paul Eggert wrote:
> Perhaps we can improve the behavior of grep by changing its heuristic
> slightly. Currently grep reports "Binary file FOO matches" if it finds
> binary data in FOO before it finds the first match.  Instead, perhaps we
> could change grep to report "Binary file FOO matches" when it sees that
> it's about to generate binary *output* copied from FOO, regardless of
> whether this output represents the first match.  That is, when grep sees
> that it's about to output binary data, grep instead outputs "Binary file
> FOO matches" and then stops output for FOO (even if it already output some
> lines for ordinary matches in FOO).
> 
> This approach would fix the problem of grep trashing the output stream, and
> it should be less drastic than grep's current approach, in that it would
> make grep more likely to do what Kamil Dudka is asking for (assuming grep
> is given mostly valid input interspersed with small amounts of binary
> data).

Thanks for the suggestion!  I believe that such approach would work for me.  
Do you want me to write a patch implementing it?

Eric, what do you think about the change proposed above?

Kamil

Information forwarded to bug-grep <at> gnu.org:
bug#20526; Package grep. (Tue, 12 May 2015 12:07:02 GMT) Full text and rfc822 format available.

Message #41 received at 20526 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Kamil Dudka <kdudka <at> redhat.com>, Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 20526 <at> debbugs.gnu.org,
 Sebastian Pöhn <sebastian.poehn <at> gmail.com>,
 debbugs-submit <at> debbugs.gnu.org, =?UTF-8?Q?P=C3=B6hn <at> debbugs.gnu.org
Subject: Re: bug#20526: BUG: text file is detected as binary
Date: Tue, 12 May 2015 06:06:13 -0600

[Message part 1 (text/plain, inline)]

On 05/12/2015 02:41 AM, Kamil Dudka wrote:
> On Monday 11 May 2015 21:27:35 Paul Eggert wrote:
>> Perhaps we can improve the behavior of grep by changing its heuristic
>> slightly. Currently grep reports "Binary file FOO matches" if it finds
>> binary data in FOO before it finds the first match.  Instead, perhaps we
>> could change grep to report "Binary file FOO matches" when it sees that
>> it's about to generate binary *output* copied from FOO, regardless of
>> whether this output represents the first match.  That is, when grep sees
>> that it's about to output binary data, grep instead outputs "Binary file
>> FOO matches" and then stops output for FOO (even if it already output some
>> lines for ordinary matches in FOO).
>>
>> This approach would fix the problem of grep trashing the output stream, and
>> it should be less drastic than grep's current approach, in that it would
>> make grep more likely to do what Kamil Dudka is asking for (assuming grep
>> is given mostly valid input interspersed with small amounts of binary
>> data).
> 
> Thanks for the suggestion!  I believe that such approach would work for me.  
> Do you want me to write a patch implementing it?
> 
> Eric, what do you think about the change proposed above?

I'm still a bit worried that encoding errors encountered on input, even
though they don't match for output, may still cause issues for some
patterns (we've had cases of encoding errors causing 'grep -P' to go
into an infinite loop, for example); but yes, as the behavior is
undefined, we are still justified in adopting those heuristics, if
someone is willing to contribute a patch along those lines.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#20526; Package grep. (Wed, 13 May 2015 00:09:01 GMT) Full text and rfc822 format available.

Message #44 received at 20526 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Eric Blake <eblake <at> redhat.com>, Kamil Dudka <kdudka <at> redhat.com>
Cc: 20526 <at> debbugs.gnu.org,
 Sebastian Pöhn <sebastian.poehn <at> gmail.com>,
 debbugs-submit <at> debbugs.gnu.org, =?UTF-8?Q?P=C3=B6hn <at> debbugs.gnu.org
Subject: Re: bug#20526: BUG: text file is detected as binary
Date: Tue, 12 May 2015 17:08:42 -0700

Eric Blake wrote:
> I'm still a bit worried that encoding errors encountered on input, even
> though they don't match for output, may still cause issues for some
> patterns (we've had cases of encoding errors causing 'grep -P' to go
> into an infinite loop, for example);

Yes, that's right.  We can't go back to the old way of doing things.  Encoding 
errors in the data must not be matched by any regular expression (not even "."). 
 'grep -P' won't loop if we never pass encoding errors to the PCRE matcher, so 
that's what we gotta do.

> but yes, as the behavior is
> undefined, we are still justified in adopting those heuristics, if
> someone is willing to contribute a patch along those lines.

The hard part about it (and the reason I haven't written up a patch yet) is 
making sure the above property holds, while continuing to have good performance 
in the typical case where the input is validly encoded.  I suppose it's OK, 
though, if the change hurts performance only for the -P case, since -P is so 
slow anyway.

Information forwarded to bug-grep <at> gnu.org:
bug#20526; Package grep. (Thu, 21 May 2015 00:50:03 GMT) Full text and rfc822 format available.

Message #47 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Ángel González <angel <at> re.16bits.net>
To: bug-grep <at> gnu.org
Subject: Re: bug#20526: BUG: text file is detected as binary
Date: Thu, 21 May 2015 02:27:43 +0200

Paul Eggert wrote:
> Perhaps we can improve the behavior of grep by changing its heuristic 
> slightly. 
>   Currently grep reports "Binary file FOO matches" if it finds binary 
> data in FOO before it finds the first match.  Instead, perhaps we 
> could change grep to report "Binary file FOO matches" when it sees 
> that it's about to generate binary *output* copied from FOO, 
> regardless of whether this output represents the first match.  That 
> is, when grep sees that it's about to output binary 
> data, grep instead outputs "Binary file FOO matches" and then stops 
> output for FOO (even if it already output some lines for ordinary 
> matches in FOO).

Another option would be to escape the problematic binary data (but how
to escape the escape char?) or maybe even replace it with U+FFFD if our
output is utf-8 (this has its own sort of problems when trying to
determine what was really matched, though).

> This approach would fix the problem of grep trashing the output 
> stream, and it should be less drastic than grep's current approach, 
> in that it would make grep more likely to do what Kamil Dudka is 
> asking for (assuming grep is given mostly valid input interspersed 
> with small amounts of binary data).

+1

When grep is the las component of a pipeline, it isn't too bad. The
danger comes from grep being part of a pipeline instead. 
Sebastian Makefile is one of such cases. Another silly example: we
might have a list of people and be interested in knowning how many of
them begin with J (but excluding pseudonyms):

 printf 'John Smith\nJohannes Meixner\nPaul Eggert\nJohn Doe\n' > defendants-2015-05-15
 grep ^J defendants-2015-05-* | sort -u | grep -vc "John Doe"

works perfectly, until the day someone provides an incorrectly entry. 
 printf 'Pedro P\xe9rez\n' >> defendants-2015-05-15
and havoc ensues.

It's something that should never happen, but someone else prepared the
file for you, or it comes from a third party (and sometimes it only
makes sense for them to be ANSI, yet one day there are unencoded high
bytes)

Merged 19230 19985 20526. Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Sat, 30 May 2015 20:05:06 GMT) Full text and rfc822 format available.

Merged 19230 19985 20526 21558. Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Fri, 25 Sep 2015 18:05:03 GMT) Full text and rfc822 format available.

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Thu, 31 Dec 2015 03:26:01 GMT) Full text and rfc822 format available.

Notification sent to Sebastian Poehn <sebastian.poehn <at> gmail.com>:
bug acknowledged by developer. (Thu, 31 Dec 2015 03:26:02 GMT) Full text and rfc822 format available.

Message #56 received at 20526-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: 20526-done <at> debbugs.gnu.org
Cc: Kamil Dudka <kdudka <at> redhat.com>, Benno Schulenberg <bensberg <at> justemail.net>,
 Mike Frysinger <vapier <at> gentoo.org>, Johannes Meixner <jsmeix <at> suse.de>,
 Hans Pelleboer <hanspelleboer <at> online.nl>,
 Sebastian Poehn <sebastian.poehn <at> gmail.com>,
 Ángel González <angel <at> re.16bits.net>,
 Eric Blake <eblake <at> redhat.com>
Subject: Re: grep BUG: text file is detected as binary
Date: Wed, 30 Dec 2015 19:25:04 -0800

[Message part 1 (text/plain, inline)]

I installed into Savannah a patch (attached) that should fix this problem in 
typical cases, and am boldly marking the bug as done. Please give the fix a try 
if you have the time. Thanks.

[0001-grep-be-less-picky-about-encoding-errors.patch (text/x-diff, attachment)]

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Thu, 31 Dec 2015 03:26:02 GMT) Full text and rfc822 format available.

Notification sent to Hans Pelleboer <hanspelleboer <at> online.nl>:
bug acknowledged by developer. (Thu, 31 Dec 2015 03:26:02 GMT) Full text and rfc822 format available.

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Thu, 31 Dec 2015 03:26:02 GMT) Full text and rfc822 format available.

Notification sent to Mike Frysinger <vapier <at> gentoo.org>:
bug acknowledged by developer. (Thu, 31 Dec 2015 03:26:02 GMT) Full text and rfc822 format available.

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Thu, 31 Dec 2015 03:26:02 GMT) Full text and rfc822 format available.

Notification sent to Benno Schulenberg <bensberg <at> justemail.net>:
bug acknowledged by developer. (Thu, 31 Dec 2015 03:26:03 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#20526; Package grep. (Thu, 31 Dec 2015 05:00:02 GMT) Full text and rfc822 format available.

Message #74 received at 20526 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: 20526 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>,
 sebastian.poehn <at> gmail.com
Cc: Johannes Meixner <jsmeix <at> suse.de>, Kamil Dudka <kdudka <at> redhat.com>,
 Benno Schulenberg <bensberg <at> justemail.net>, 20526-done <at> debbugs.gnu.org
Subject: Re: bug#20526: grep BUG: text file is detected as binary
Date: Wed, 30 Dec 2015 20:59:30 -0800

On Wed, Dec 30, 2015 at 7:25 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> I installed into Savannah a patch (attached) that should fix this problem in
> typical cases, and am boldly marking the bug as done. Please give the fix a
> try if you have the time. Thanks.

Thank you!
The combination of this and the grep -oP infloop fix make this look
like a good time for a bug-fix release. If there are any other pending
bug fixes or small+safe changes people would like to see included,
please let us know.

I would like to publish a pre-release snapshot soon.

Information forwarded to bug-grep <at> gnu.org:
bug#20526; Package grep. (Thu, 31 Dec 2015 05:00:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#20526; Package grep. (Thu, 31 Dec 2015 09:30:02 GMT) Full text and rfc822 format available.

Message #80 received at 20526 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>, 20526 <at> debbugs.gnu.org,
 sebastian.poehn <at> gmail.com
Cc: Johannes Meixner <jsmeix <at> suse.de>, Kamil Dudka <kdudka <at> redhat.com>,
 Benno Schulenberg <bensberg <at> justemail.net>
Subject: Re: bug#20526: grep BUG: text file is detected as binary
Date: Thu, 31 Dec 2015 01:29:35 -0800

Jim Meyering wrote:
> The combination of this and the grep -oP infloop fix make this look
> like a good time for a bug-fix release. If there are any other pending
> bug fixes or small+safe changes people would like to see included,
> please let us know.

I have one major qualm about this: since 'grep' no longer checks whether the 
input is correctly encoded, I expect this may hurt -P performance significantly 
(though it may help non -P performance). This is because PCRE is slow at 
checking whether input data are valid UTF-8. I just now did a brief check and 
found one major performance issue:

grep -rP 'fed.*cba' .

On my machine the above command is 125x slower with the new grep than the old 
one, which suggests some tuning is in order before releasing. (It's bogged down 
inside libpcre somewhere.)

Since you wrote your email I did a triage of the outstanding bugs, except for 
the bugs where patches are available which are mostly performance-related, and 
where I expect there will be some stuff that is relevant to -P slowdown.

Information forwarded to bug-grep <at> gnu.org:
bug#20526; Package grep. (Thu, 31 Dec 2015 15:24:02 GMT) Full text and rfc822 format available.

Message #83 received at 20526 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: eggert <at> cs.ucla.edu
Cc: sebastian.poehn <at> gmail.com, 20526 <at> debbugs.gnu.org
Subject: Re: bug#20526: grep BUG: text file is detected as binary
Date: Fri, 01 Jan 2016 00:23:11 +0900

On Wed, 30 Dec 2015 19:25:04 -0800
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> I installed into Savannah a patch (attached) that should fix this
> problem in typical cases, and am boldly marking the bug as done.
> Please give the fix a try if you have the time. Thanks.

I get following output after apply the patch.  Is it expected?

$ printf 'a\na\377\na\n' | LANG=en_US.utf8 src/grep a
a
Binary file (standard input) matches

Information forwarded to bug-grep <at> gnu.org:
bug#20526; Package grep. (Sat, 02 Jan 2016 00:08:02 GMT) Full text and rfc822 format available.

Message #86 received at 20526 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: sebastian.poehn <at> gmail.com, 20526 <at> debbugs.gnu.org
Subject: Re: bug#20526: grep BUG: text file is detected as binary
Date: Sat, 02 Jan 2016 06:39:03 +0900

On Thu, 31 Dec 2015 10:04:06 -0800
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> Yes, it's expected.  Thanks, this should be stated more clearly, so I installed the attached documentation patch.

Thanks.

By the way, why this check is applied in only multi-byte locale?  e.g.
if \200 is included in en_US.iso88591 which is not POSIX locale, I think
grep may need to return `Binary file ... matches', as mbrlen(3) returns
-1 for \200.

Information forwarded to bug-grep <at> gnu.org:
bug#20526; Package grep. (Sat, 02 Jan 2016 00:31:01 GMT) Full text and rfc822 format available.

Message #89 received at 20526 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: sebastian.poehn <at> gmail.com, 20526 <at> debbugs.gnu.org
Subject: Re: bug#20526: grep BUG: text file is detected as binary
Date: Thu, 31 Dec 2015 10:04:06 -0800

[Message part 1 (text/plain, inline)]

Norihiro Tanaka wrote:
> I get following output after apply the patch.  Is it expected?
>
> $ printf 'a\na\377\na\n' | LANG=en_US.utf8 src/grep a
> a
> Binary file (standard input) matches

Yes, it's expected.  Thanks, this should be stated more clearly, so I installed 
the attached documentation patch.

[0001-doc-clarify-text-vs-binary-match-output.patch (text/x-diff, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#20526; Package grep. (Sat, 02 Jan 2016 05:24:01 GMT) Full text and rfc822 format available.

Message #92 received at 20526 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: sebastian.poehn <at> gmail.com, 20526 <at> debbugs.gnu.org
Subject: Re: bug#20526: grep BUG: text file is detected as binary
Date: Fri, 1 Jan 2016 21:22:54 -0800

[Message part 1 (text/plain, inline)]

Norihiro Tanaka wrote:
> why this check is applied in only multi-byte locale?

Ouch, good point. I missed the possibility of a unibyte encoding where not all 
bytes are valid unibyte characters. I installed the attached additional patch to 
fix this, and to test for the bug I recently introduced here.

[0001-grep-fix-bug-with-with-invalid-unibyte-sequence.txt (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#20526; Package grep. (Sun, 03 Jan 2016 01:33:02 GMT) Full text and rfc822 format available.

Message #95 received at 20526 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: sebastian.poehn <at> gmail.com, 20526 <at> debbugs.gnu.org
Subject: Re: bug#20526: grep BUG: text file is detected as binary
Date: Sun, 03 Jan 2016 10:32:06 +0900

[Message part 1 (text/plain, inline)]

On Fri, 1 Jan 2016 21:22:54 -0800
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> Ouch, good point. I missed the possibility of a unibyte encoding where
> not all bytes are valid unibyte characters. I installed the attached
> additional patch to fix this, and to test for the bug I recently
> introduced here.

Thanks, I see that it is good idea, but I propose minor change for your
fix.  Perhaps, it will be what you want.

[0001-grep-minor-improvements-to-previous-change.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#20526; Package grep. (Tue, 05 Jan 2016 11:28:02 GMT) Full text and rfc822 format available.

Message #98 received at 20526-done <at> debbugs.gnu.org (full text, mbox):

From: Kamil Dudka <kdudka <at> redhat.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Benno Schulenberg <bensberg <at> justemail.net>, 20526-done <at> debbugs.gnu.org,
 Ángel González <angel <at> 16bits.net>,
 Johannes Meixner <jsmeix <at> suse.de>, Hans Pelleboer <hanspelleboer <at> online.nl>,
 Sebastian Poehn <sebastian.poehn <at> gmail.com>,
 Mike Frysinger <vapier <at> gentoo.org>, Eric Blake <eblake <at> redhat.com>
Subject: Re: grep BUG: text file is detected as binary
Date: Tue, 05 Jan 2016 12:26:52 +0100

On Wednesday 30 December 2015 19:25:04 Paul Eggert wrote:
> I installed into Savannah a patch (attached) that should fix this problem in
> typical cases, and am boldly marking the bug as done. Please give the fix a
> try if you have the time. Thanks.

Thanks for the fixup!  I can confirm that it resolves the issue described at:

https://bugzilla.redhat.com/1219141

Kamil

Information forwarded to bug-grep <at> gnu.org:
bug#20526; Package grep. (Wed, 06 Jan 2016 07:34:02 GMT) Full text and rfc822 format available.

Message #101 received at 20526 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: sebastian.poehn <at> gmail.com, 20526 <at> debbugs.gnu.org
Subject: Re: bug#20526: grep BUG: text file is detected as binary
Date: Tue, 5 Jan 2016 23:33:39 -0800

[Message part 1 (text/plain, inline)]

Norihiro Tanaka wrote:
> I see that it is good idea, but I propose minor change for your
> fix.  Perhaps, it will be what you want.

I think the problem here is that the code was not computing unibyte_mask 
correctly; that is, the comment for unibyte_mask is correct, and usage of 
unibyte_mask is correct, but unibyte_mask was sometimes initialized incorrectly 
in unusual locales. I installed the attached patch to try to fix that. Computing 
an optimal unibyte_mask (for a reasonable definition of "optimal") is likely 
more trouble than it is worth.

[0001-Fix-calculation-of-unibyte_mask.patch (text/x-diff, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#20526; Package grep. (Wed, 06 Jan 2016 08:33:01 GMT) Full text and rfc822 format available.

Message #104 received at 20526 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>, 20526 <at> debbugs.gnu.org,
 sebastian.poehn <at> gmail.com
Cc: Johannes Meixner <jsmeix <at> suse.de>, Kamil Dudka <kdudka <at> redhat.com>,
 Benno Schulenberg <bensberg <at> justemail.net>
Subject: Re: bug#20526: grep BUG: text file is detected as binary
Date: Wed, 6 Jan 2016 00:32:17 -0800

[Message part 1 (text/plain, inline)]

Paul Eggert wrote:

> grep -rP 'fed.*cba' .
>
> On my machine the above command is 125x slower with the new grep than the old
> one, which suggests some tuning is in order before releasing. (It's bogged down
> inside libpcre somewhere.)

I installed the attached patch, which fixed this performance bug for me.

[0001-grep-restore-P-PCRE_NO_UTF8_CHECK-optimization.patch (text/x-diff, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#20526; Package grep. (Wed, 06 Jan 2016 17:58:01 GMT) Full text and rfc822 format available.

Message #107 received at 20526 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>, 20526 <at> debbugs.gnu.org,
 sebastian.poehn <at> gmail.com
Cc: Johannes Meixner <jsmeix <at> suse.de>, Kamil Dudka <kdudka <at> redhat.com>,
 Benno Schulenberg <bensberg <at> justemail.net>
Subject: Re: bug#20526: grep BUG: text file is detected as binary
Date: Wed, 6 Jan 2016 09:57:46 -0800

[Message part 1 (text/plain, inline)]

On 01/06/2016 12:32 AM, Paul Eggert wrote:
> I installed the attached patch, which fixed this performance bug for me. 

Whoops! I forgot to 'git add src/search.h' before committing. We also 
need the attached followup patch, which I installed.

[0001-grep-restore-P-optimization-followup-fix.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#20526; Package grep. (Wed, 06 Jan 2016 18:13:02 GMT) Full text and rfc822 format available.

Message #110 received at 20526 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Sebastian Pöhn <sebastian.poehn <at> gmail.com>,
 Kamil Dudka <kdudka <at> redhat.com>, Benno Schulenberg <bensberg <at> justemail.net>,
 20526 <at> debbugs.gnu.org, Johannes Meixner <jsmeix <at> suse.de>
Subject: Re: bug#20526: grep BUG: text file is detected as binary
Date: Wed, 6 Jan 2016 10:11:36 -0800

On Wed, Jan 6, 2016 at 9:57 AM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 01/06/2016 12:32 AM, Paul Eggert wrote:
>>
>> I installed the attached patch, which fixed this performance bug for me.
>
> Whoops! I forgot to 'git add src/search.h' before committing. We also need
> the attached followup patch, which I installed.

Oh, perfect!  Thank you once again.  Happy new year.

Interestingly, while running tests of the just-updated code, I've just
noticed an unrelated false-positive failure on fast systems: I will
adjust the mb-non-UTF8-performance test to be more adaptive: rather
than using a fixed-size input, I'll choose one that is large enough to
make the unibyte grep invocation take a certain amount of time.

Once that's resolved, I'll make a pre-release snapshot, planning to
let that soak for a couple weeks before releasing grep-2.23.

Information forwarded to bug-grep <at> gnu.org:
bug#20526; Package grep. (Fri, 08 Jan 2016 13:45:01 GMT) Full text and rfc822 format available.

Message #113 received at 20526 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Kamil Dudka <kdudka <at> redhat.com>, Benno Schulenberg <bensberg <at> justemail.net>,
 Jim Meyering <jim <at> meyering.net>, Johannes Meixner <jsmeix <at> suse.de>,
 sebastian.poehn <at> gmail.com, 22103-done <at> gnu.org, 20526 <at> debbugs.gnu.org
Subject: Re: bug#20526: grep BUG: text file is detected as binary
Date: Fri, 08 Jan 2016 22:44:28 +0900

On Wed, 6 Jan 2016 09:57:46 -0800
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> On 01/06/2016 12:32 AM, Paul Eggert wrote:
> > I installed the attached patch, which fixed this performance bug for me. 
> Whoops! I forgot to 'git add src/search.h' before committing. We also need the attached followup patch, which I installed.

Great!   Thanks, many issues including for output of invalid sequence
are fixed by your patches.  bug#22103 is also fixed in them, so I am
closing it.

Information forwarded to bug-grep <at> gnu.org:
bug#20526; Package grep. (Fri, 08 Jan 2016 15:29:02 GMT) Full text and rfc822 format available.

Message #116 received at 20526 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: sebastian.poehn <at> gmail.com, 20526 <at> debbugs.gnu.org
Subject: Re: bug#20526: grep BUG: text file is detected as binary
Date: Fri, 8 Jan 2016 07:27:56 -0800

[Message part 1 (text/plain, inline)]

Paul Eggert wrote:
> I missed the possibility of a unibyte encoding where not all bytes are valid
> unibyte characters.

I found a significant performance problem related to that bug and bug fix, and 
installed the attached further patch 0001. Come to think of it, this issue 
should be in NEWS too, so I added the attached patch 0002.

[0001-grep-improve-unibyte-P-performance.patch (text/x-diff, attachment)]

[0002-doc-mention-unibyte-encoding-fix.patch (text/x-diff, attachment)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 06 Feb 2016 12:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 8 years and 85 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #20526 BUG: text file is detected as binary

GNU bug report logs - #20526
BUG: text file is detected as binary