GNU bug report logs - #29668
grep: Fatal problem with (big) file

Previous Next

Package: grep;

Reported by: pg <pasi.vitsa <at> yahoo.com>

Date: Mon, 11 Dec 2017 22:03:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 29668 in the body.
You can then email your comments to 29668 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#29668; Package grep. (Mon, 11 Dec 2017 22:03:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to pg <pasi.vitsa <at> yahoo.com>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Mon, 11 Dec 2017 22:03:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: pg <pasi.vitsa <at> yahoo.com>
To: bug-grep <at> gnu.org
Cc: toimitus <at> masinistit.com, webmaster <at> ubuntu.com
Subject: grep: Fatal problem with (big) file
Date: Mon, 11 Dec 2017 23:45:25 +0200
Hello!

$ awk '/Volvo/' Tieliikenne5.0.csv | wc -l
266175
$ grep Volvo Tieliikenne5.0.csv | wc -l
1638
$ echo $? (after "grep  Volvo Tieliikenne5.0.csv" only too)
0
$ ack Volvo Tieliikenne5.0.csv | wc -l
266175

The file contain 5 milj. lines. It is the vehicle DB dump of Finland:
http://trafiopendata.97.fi/opendata/171009_Tieliikenne_5_0.zip

$ uname -a
Linux pg-desktop 4.10.0-40-generic #44~16.04.1-Ubuntu SMP Thu Nov 9
15:37:44 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Fatal error with ”small” file too:
$ awk '/Volvo/' Tieliikenne5.0.csv > volvot.csv
$ awk '/N3/'  volvot.csv | wc -l
17822
$ grep N3 volvot.csv | wc -l
1701
$ wc -l volvot.csv 
266175 volvot.csv

BR
pg

PS: Ubuntu webmaster - pls put error rep adr into your system and fwd
msg?
PPS: toimitus - Kyllä mää ennen olen osannut grepata;-)
PPPS: pointer error again? use perl or die!




Information forwarded to bug-grep <at> gnu.org:
bug#29668; Package grep. (Mon, 11 Dec 2017 23:37:02 GMT) Full text and rfc822 format available.

Message #8 received at 29668 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: pg <pasi.vitsa <at> yahoo.com>
Cc: 29668 <at> debbugs.gnu.org, toimitus <at> masinistit.com, webmaster <at> ubuntu.com
Subject: Re: bug#29668: grep: Fatal problem with (big) file
Date: Tue, 12 Dec 2017 08:36:36 +0900
On Mon, 11 Dec 2017 23:45:25 +0200
pg <pasi.vitsa <at> yahoo.com> wrote:

> $ awk '/Volvo/' Tieliikenne5.0.csv | wc -l
> 266175
> $ grep Volvo Tieliikenne5.0.csv | wc -l
> 1638

> $ awk '/N3/' volvot.csv | wc -l
> 17822
> $ grep N3 volvot.csv | wc -l
> 1701

Perhaps, characters not to be able to recognize in your locale included
in Tieliikenne 5.0.csv and volvot.csv are included.  Try below.

--
$ env LC_ALL=C grep 'Volvo' Tieliikenne\ 5.0.csv | wc -l
266175

or

$ grep -a 'Volvo' Tieliikenne\ 5.0.csv | wc -l
266175

--
$ env LC_ALL=C grep N3 volvot.csv | wc -l
17822

or

$ grep -a N3 volvot.csv | wc -l
17822





Information forwarded to bug-grep <at> gnu.org:
bug#29668; Package grep. (Wed, 13 Dec 2017 00:29:02 GMT) Full text and rfc822 format available.

Message #11 received at 29668 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, pg <pasi.vitsa <at> yahoo.com>
Cc: 29668 <at> debbugs.gnu.org, toimitus <at> masinistit.com, webmaster <at> ubuntu.com
Subject: Re: bug#29668: grep: Fatal problem with (big) file
Date: Tue, 12 Dec 2017 16:28:09 -0800
On 12/11/2017 03:36 PM, Norihiro Tanaka wrote:
> Perhaps, characters not to be able to recognize in your locale included
> in Tieliikenne 5.0.csv and volvot.csv are included.

Yes, that's the problem. The original 'grep' output ended in "Binary 
file Tieliikenne5.0.csv matches" but the user didn't see that. Perhaps 
we should send that diagnostic to stderr as well.





Information forwarded to bug-grep <at> gnu.org:
bug#29668; Package grep. (Wed, 13 Dec 2017 23:26:01 GMT) Full text and rfc822 format available.

Message #14 received at 29668 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 29668 <at> debbugs.gnu.org, toimitus <at> masinistit.com, webmaster <at> ubuntu.com,
 pg <pasi.vitsa <at> yahoo.com>
Subject: Re: bug#29668: grep: Fatal problem with (big) file
Date: Thu, 14 Dec 2017 08:25:26 +0900
On Tue, 12 Dec 2017 16:28:09 -0800
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> On 12/11/2017 03:36 PM, Norihiro Tanaka wrote:
> > Perhaps, characters not to be able to recognize in your locale included
> > in Tieliikenne 5.0.csv and volvot.csv are included.
> 
> Yes, that's the problem. The original 'grep' output ended in "Binary file Tieliikenne5.0.csv matches" but the user didn't see that. Perhaps we should send that diagnostic to stderr as well.

I don't seem that that's problem.  the user pass output of grep to wc -l,
so `Binary file ... matches' line is also counted by `wc' as one line.

$ env LC_ALL=C grep 'Volvo' Tieliikenne\ 5.0.csv | wc -l
266175
$ env LC_ALL=en_US.utf8 grep 'Volvo' Tieliikenne\ 5.0.csv | wc -l
241264
$ env LC_ALL=en_US.utf8 grep 'Volvo' Tieliikenne\ 5.0.csv | tail -1
Binary file Tieliikenne 5.0.csv matches

$ env LC_ALL=C grep N3 volvot.csv | wc -l
17822
$ env LC_ALL=en_US.utf8 grep N3 volvot.csv | wc -l
11741
$ env LC_ALL=en_US.utf8 grep N3 volvot.csv | tail -1
Binary file volvot.csv matches





Information forwarded to bug-grep <at> gnu.org:
bug#29668; Package grep. (Thu, 14 Dec 2017 00:05:02 GMT) Full text and rfc822 format available.

Message #17 received at 29668 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 29668 <at> debbugs.gnu.org, toimitus <at> masinistit.com, webmaster <at> ubuntu.com,
 pg <pasi.vitsa <at> yahoo.com>
Subject: Re: bug#29668: grep: Fatal problem with (big) file
Date: Wed, 13 Dec 2017 16:03:57 -0800
On 12/13/2017 03:25 PM, Norihiro Tanaka wrote:
> I don't seem that that's problem.  the user pass output of grep to wc -l,
> so `Binary file ... matches' line is also counted by `wc' as one line.

The intent of 'grep PATTERN | wc -l' is to count the number of matches, 
like 'grep -c PATTERN' would. But it doesn't work that way here. E.g., 
on Fedora 27 with LANG=en_US.UTF-8:

$ grep -c Volvo Tieliikenne5.0.csv
266175
$ grep Volvo Tieliikenne5.0.csv | wc -l
241264
$ grep Volvo Tieliikenne5.0.csv | tail -n 1
Binary file Tieliikenne5.0.csv matches

If the "Binary file ... matches" line were sent to stdout instead of to 
stderr, the problem would be more obvious to the user:

$ grep -c Volvo Tieliikenne5.0.csv
266175
$ grep Volvo Tieliikenne5.0.csv | wc -l
Binary file Tieliikenne5.0.csv matches
241264
$ grep Volvo Tieliikenne5.0.csv | tail -n 1
Binary file Tieliikenne5.0.csv matches
T;2017-09-29;75;01;;;19550000;;;;;1;1570;;3000;2595;1670;;01;2200;20.6;4;false;false;Volvo;;;;;01;;01;977;;;841;;5092946

I believe that in the past I've thought that the "Binary file" message 
should be sent to stdout, but these examples are a reasonably compelling 
reason to send them to stderr instead.




Information forwarded to bug-grep <at> gnu.org:
bug#29668; Package grep. (Sat, 16 Dec 2017 00:27:02 GMT) Full text and rfc822 format available.

Message #20 received at 29668 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 29668 <at> debbugs.gnu.org, toimitus <at> masinistit.com, webmaster <at> ubuntu.com,
 pg <pasi.vitsa <at> yahoo.com>
Subject: Re: bug#29668: grep: Fatal problem with (big) file
Date: Sat, 16 Dec 2017 09:25:59 +0900
On Wed, 13 Dec 2017 16:03:57 -0800
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> On 12/13/2017 03:25 PM, Norihiro Tanaka wrote:
> > I don't seem that that's problem.  the user pass output of grep to wc -l,
> > so `Binary file ... matches' line is also counted by `wc' as one line.
> 
> The intent of 'grep PATTERN | wc -l' is to count the number of matches, like 'grep -c PATTERN' would. But it doesn't work that way here. E.g., on Fedora 27 with LANG=en_US.UTF-8:
> 
> $ grep -c Volvo Tieliikenne5.0.csv
> 266175
> $ grep Volvo Tieliikenne5.0.csv | wc -l
> 241264
> $ grep Volvo Tieliikenne5.0.csv | tail -n 1
> Binary file Tieliikenne5.0.csv matches
> 
> If the "Binary file ... matches" line were sent to stdout instead of to stderr, the problem would be more obvious to the user:
> 
> $ grep -c Volvo Tieliikenne5.0.csv
> 266175
> $ grep Volvo Tieliikenne5.0.csv | wc -l
> Binary file Tieliikenne5.0.csv matches
> 241264
> $ grep Volvo Tieliikenne5.0.csv | tail -n 1
> Binary file Tieliikenne5.0.csv matches
> T;2017-09-29;75;01;;;19550000;;;;;1;1570;;3000;2595;1670;;01;2200;20.6;4;false;false;Volvo;;;;;01;;01;977;;;841;;5092946
> 
> I believe that in the past I've thought that the "Binary file" message should be sent to stdout, but these examples are a reasonably compelling reason to send them to stderr instead.

In addition, the following problem can also occur.

$ printf 'Binary file a.txt matches\n' >a.txt
$ env LC_ALL=en_US.utf8 grep B a.txt
Binary file a.txt matches

$ printf '\xFFB\n' >a.txt
$ env LC_ALL=en_US.utf8 grep B a.txt
Binary file a.txt matches

Both are same output.  However, the former displays the contents of the
matched line, OTOH the latter is not so.  if "Binary file" is sent to stdout,
a user can not distinguish whether a.txt is text file or a binary file
without opening the file.





Information forwarded to bug-grep <at> gnu.org:
bug#29668; Package grep. (Thu, 02 Jan 2020 08:55:02 GMT) Full text and rfc822 format available.

Message #23 received at 29668 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jason Franklin <jrf <at> elitemail.org>
Cc: 33552 <at> debbugs.gnu.org, 29668 <at> debbugs.gnu.org
Subject: Re: Possible bug with handling -I option
Date: Thu, 2 Jan 2020 00:54:42 -0800
Jason, thanks for reporting this grep bug <https://bugs.gnu.org/33552>. It
strikes me that this is related to another grep bug <https://bugs.gnu.org/29668>
concerning the "Binary files ..." message. Although they're not the same bug,
it's likely that fixing one will also entail fixing the other. So I'll add a
message to both bug reports to this effect.




Information forwarded to bug-grep <at> gnu.org:
bug#29668; Package grep. (Thu, 17 Sep 2020 18:47:01 GMT) Full text and rfc822 format available.

Message #26 received at 29668 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jason Franklin <jrf <at> elitemail.org>
Cc: 33552 <at> debbugs.gnu.org, 29668 <at> debbugs.gnu.org, pg <pasi.vitsa <at> yahoo.com>,
 Jim Meyering <jim <at> meyering.net>, Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Subject: grep patches for "Binary file FOO matches" glitches
Date: Thu, 17 Sep 2020 11:46:03 -0700
[Message part 1 (text/plain, inline)]
Attached are two related 'grep' patches, one prompted by Bug#33552 "Possible bug 
with handling -I option" and the other by Bug#29668 "grep: Fatal problem with 
(big) file". Although I'd normally install these on grep master, Jim has started 
the ball rolling on the next grep release so I'll cc this to him to see whether 
these patches can be squeezed in before the next release.
[0001-Suppress-Binary-file-FOO-matches-if-I.patch (text/x-patch, attachment)]
[0002-Send-Binary-file-FOO-matches-to-stderr.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#29668; Package grep. (Thu, 17 Sep 2020 19:06:02 GMT) Full text and rfc822 format available.

Message #29 received at 29668 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 33552 <at> debbugs.gnu.org, 29668 <at> debbugs.gnu.org, pg <pasi.vitsa <at> yahoo.com>,
 Norihiro Tanaka <noritnk <at> kcn.ne.jp>, Jason Franklin <jrf <at> elitemail.org>
Subject: Re: grep patches for "Binary file FOO matches" glitches
Date: Thu, 17 Sep 2020 12:04:55 -0700
On Thu, Sep 17, 2020 at 11:46 AM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> Attached are two related 'grep' patches, one prompted by Bug#33552 "Possible bug
> with handling -I option" and the other by Bug#29668 "grep: Fatal problem with
> (big) file". Although I'd normally install these on grep master, Jim has started
> the ball rolling on the next grep release so I'll cc this to him to see whether
> these patches can be squeezed in before the next release.

Nice! Thank you for resolving those.
The first one did indeed simplify numerous tests.
Both look fine and seem uncontroversial, so please go ahead and push them.
I'll probably update to latest gnulib this evening and then make a new snapshot.




Information forwarded to bug-grep <at> gnu.org:
bug#29668; Package grep. (Fri, 18 Sep 2020 03:00:02 GMT) Full text and rfc822 format available.

Message #32 received at 29668 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>
Cc: 33552 <at> debbugs.gnu.org, 29668 <at> debbugs.gnu.org, pg <pasi.vitsa <at> yahoo.com>,
 Jason Franklin <jrf <at> elitemail.org>
Subject: Re: bug#29668: grep patches for "Binary file FOO matches" glitches
Date: Thu, 17 Sep 2020 19:58:58 -0700
[Message part 1 (text/plain, inline)]
On 9/17/20 3:03 PM, Jim Meyering wrote:
> The alternative is to change that "B" to a "b", which should be fine,
> now that it's only emitted to stderr.

Makes sense.

NEWS should be updated accordingly - but when I looked into doing that I came up 
with the attached more-elaborate patch, which changes this new diagnostic and 
two other unusual-format diagnostics, so that they use the same "grep: FILENAME: 
MESSAGE" form that grep uses everywhere else. Whaddya think?
[0001-grep-be-more-consistent-about-diagnostic-format.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#29668; Package grep. (Fri, 18 Sep 2020 14:07:01 GMT) Full text and rfc822 format available.

Message #35 received at 29668 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 33552 <at> debbugs.gnu.org, 29668 <at> debbugs.gnu.org, pg <pasi.vitsa <at> yahoo.com>,
 Jason Franklin <jrf <at> elitemail.org>
Subject: Re: bug#29668: grep patches for "Binary file FOO matches" glitches
Date: Fri, 18 Sep 2020 07:05:48 -0700
On Thu, Sep 17, 2020 at 7:59 PM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 9/17/20 3:03 PM, Jim Meyering wrote:
> > The alternative is to change that "B" to a "b", which should be fine,
> > now that it's only emitted to stderr.
>
> Makes sense.
>
> NEWS should be updated accordingly - but when I looked into doing that I came up
> with the attached more-elaborate patch, which changes this new diagnostic and
> two other unusual-format diagnostics, so that they use the same "grep: FILENAME:
> MESSAGE" form that grep uses everywhere else. Whaddya think?

Nice. Dropping the quote module (even if negligible size delta) is a
fine side effect. You're welcome to push that.
Thanks!




Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Mon, 21 Sep 2020 17:56:02 GMT) Full text and rfc822 format available.

Notification sent to pg <pasi.vitsa <at> yahoo.com>:
bug acknowledged by developer. (Mon, 21 Sep 2020 17:56:03 GMT) Full text and rfc822 format available.

Message #40 received at 29668-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>
Cc: 29668-done <at> debbugs.gnu.org, 33552-done <at> debbugs.gnu.org,
 pg <pasi.vitsa <at> yahoo.com>, Jason Franklin <jrf <at> elitemail.org>
Subject: Re: bug#33552: grep patches for "Binary file FOO matches" glitches
Date: Mon, 21 Sep 2020 10:54:58 -0700
On 9/17/20 12:04 PM, Jim Meyering wrote:
> please go ahead and push them.

As that's been done and the bug fixes are now installed, I'm closing both bug 
reports.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 20 Oct 2020 11:24:12 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 188 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.