GNU bug report logs - #22028
grep -Pc / grep -P | wc -l inconsistent results

Previous Next

Package: grep;

Reported by: Jaroslav Skarvada <jskarvad <at> redhat.com>

Date: Fri, 27 Nov 2015 11:30:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 22028 in the body.
You can then email your comments to 22028 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#22028; Package grep. (Fri, 27 Nov 2015 11:30:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Jaroslav Skarvada <jskarvad <at> redhat.com>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Fri, 27 Nov 2015 11:30:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Jaroslav Skarvada <jskarvad <at> redhat.com>
To: bug-grep <at> gnu.org
Subject: grep -Pc / grep -P | wc -l inconsistent results
Date: Fri, 27 Nov 2015 06:29:31 -0500 (EST)
[Message part 1 (text/plain, inline)]
Hi,

it seems for long files which starts with non binary data and if PCRE matcher
is used, grep works in TEXTBIN_UNKNOWN mode until it finds binary data, then it
switches to TEXTBIN_BINARY. But in -Pc mode in TEXTBIN_BINARY it exits
on next match causing bogus -Pc results.

Reproducer:
$ grep -P -c 'Blocked by (SpamAssassin|Spamfilter)' ./filtered.txt
1
$ grep -P 'Blocked by (SpamAssassin|Spamfilter)' ./filtered.txt | wc -l
2

The ./filtered.txt is long enough text file, that contains some NULLs after the
first 32kB text, e.g. https://bugzilla.redhat.com/attachment.cgi?id=1080646

Original downstream bugzilla:
https://bugzilla.redhat.com/attachment.cgi?id=1080646

Attached is my attempt to fix it, but it may be not the right way
how to fix it. Especially the question is whether it should stop when
it finds binary data or not. But at least the grep -Pc / grep -P | wc -l
should behave the same

thanks & regards

Jaroslav
[0001-grep-do-not-stop-on-binary-data-if-counting-in-PCRE.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#22028; Package grep. (Sat, 28 Nov 2015 06:17:02 GMT) Full text and rfc822 format available.

Message #8 received at 22028 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: 22028 <at> debbugs.gnu.org
Cc: Jaroslav Skarvada <jskarvad <at> redhat.com>
Subject: Re: bug#22028: grep -Pc / grep -P | wc -l inconsistent results
Date: Sat, 28 Nov 2015 15:16:30 +0900
[Message part 1 (text/plain, inline)]
On Fri, 27 Nov 2015 06:29:31 -0500 (EST)
Jaroslav Skarvada <jskarvad <at> redhat.com> wrote:

> Hi,
> 
> it seems for long files which starts with non binary data and if PCRE matcher
> is used, grep works in TEXTBIN_UNKNOWN mode until it finds binary data, then it
> switches to TEXTBIN_BINARY. But in -Pc mode in TEXTBIN_BINARY it exits
> on next match causing bogus -Pc results.
> 
> Reproducer:
> $ grep -P -c 'Blocked by (SpamAssassin|Spamfilter)' ./filtered.txt
> 1
> $ grep -P 'Blocked by (SpamAssassin|Spamfilter)' ./filtered.txt | wc -l
> 2
> 
> The ./filtered.txt is long enough text file, that contains some NULLs after the
> first 32kB text, e.g. https://bugzilla.redhat.com/attachment.cgi?id=1080646
> 
> Original downstream bugzilla:
> https://bugzilla.redhat.com/attachment.cgi?id=1080646
> 
> Attached is my attempt to fix it, but it may be not the right way
> how to fix it. Especially the question is whether it should stop when
> it finds binary data or not. But at least the grep -Pc / grep -P | wc -l
> should behave the same
> 
> thanks & regards
> 
> Jaroslav

I see that filter.txt is binary file, as NULs are included at line 647.
However, first 32768 bytes are correctly enocoded.

If first 32768 bytes of a file are correct encoding, grep -P marks with
not TEXTBIN_TEXT but TEXTBIN_UNKNOWN, and if grep found first match,
marks with TEXTBIN_TEXT.  However, grep -P -c does not do last behavior.


grep -P treats as TEXTBIN_UNKNOWN, and if grep found first match, treats
as text file.  However, grep -P -c does not do it.

So you can get number of matched lines with grep -a -P -c.

Thanks,
Norihiro
[0001-grep-P-grep-Pc-consistent-results.patch (text/plain, attachment)]

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Thu, 31 Dec 2015 07:28:01 GMT) Full text and rfc822 format available.

Notification sent to Jaroslav Skarvada <jskarvad <at> redhat.com>:
bug acknowledged by developer. (Thu, 31 Dec 2015 07:28:02 GMT) Full text and rfc822 format available.

Message #13 received at 22028-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jaroslav Skarvada <jskarvad <at> redhat.com>
Cc: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, 22028-done <at> debbugs.gnu.org
Subject: Re: grep -Pc / grep -P | wc -l inconsistent results
Date: Wed, 30 Dec 2015 23:27:37 -0800
[Message part 1 (text/plain, inline)]
Thanks for the bug report and fix, Jaroslav. And thanks, Norihiro, for the test 
case; I think I independently came up with something similar to your grep.c fix 
in my earlier patches today and so I expect that part of your changes are no 
longer needed. I installed the attached combined patch for this bug and am 
marking it as done.
[0001-grep-c-should-keep-counting-after-binary-data.patch (text/x-diff, attachment)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 28 Jan 2016 12:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 8 years and 111 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.