GNU bug report logs - #17245
GREP BUG: grep -P and binary files

Previous Next

Package: grep;

Reported by: damon <dh <at> bug-grep.usrbin.org>

Date: Sat, 12 Apr 2014 00:28:01 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 17245 in the body.
You can then email your comments to 17245 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#17245; Package grep. (Sat, 12 Apr 2014 00:28:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to damon <dh <at> bug-grep.usrbin.org>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Sat, 12 Apr 2014 00:28:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: damon <dh <at> bug-grep.usrbin.org>
To: bug-grep <at> gnu.org
Subject: GREP BUG: grep -P and binary files
Date: Fri, 11 Apr 2014 16:47:03 -0700
[Message part 1 (text/plain, inline)]
Hi there -

I recently noticed a bug after upgrading grep and have tracked it
through a few versions now.

I was using grep -P (PCRE grep) in some scripts to grep through
directory of files, and the process would keep aborting with a
segmentation fault.

The last known good version is grep-2.14.  Every version after that has
failed in a slightly different way, making me think this could be a bug
in grep, not in pcre.

I tried compiling greps 2.14 through 2.18 against the latest pcre
library, pcre-8.33.  Here's what happens when i try each version against
a random binary file, attached to this message as test-image.png.  This
file was just one of many that caused the errors, though not every
binary file does.

Below are some results demonstrating what's going wrong.  Note that all
of these seem to work fine with regular grep or with grep -E.  Please
let me know what else i can do to help track this down!

# grep-2.14/src/grep -P '\[.?max' test-image.png
(works, does not match)

# grep-2.15/src/grep -P '\[.?max' test-image.png
Aborted

# grep-2.16/src/grep -P '\[.?max' test-image.png
Binary file test-image.png matches
(erroneous - should not match)

# grep-2.16/src/grep -P '.?max' test-image.png
Segmentation fault

# grep-2.17/src/grep -P '\[.?max' test-image.png
Segmentation fault

# grep-2.18/src/grep -P '\[.?max' test-image.png
Segmentation fault

# grep-2.18/src/grep -P '.?ma' test-image.png
Segmentation fault

# grep-2.18/src/grep -P '.?m' test-image.png
Binary file test-image.png matches

-damon
[test-image.png (image/png, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#17245; Package grep. (Sat, 12 Apr 2014 16:17:02 GMT) Full text and rfc822 format available.

Message #8 received at 17245 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: damon <dh <at> bug-grep.usrbin.org>
Cc: 17245 <at> debbugs.gnu.org
Subject: bug#17245: GREP BUG: grep -P and binary files
Date: Sun, 13 Apr 2014 01:16:25 +0900
[Message part 1 (text/plain, inline)]
This bug is similar to bug#16586.

It seems that the pointer `eptr' for a current position in a text
exceeded the starting position in backword searching.  I seem that PCRE
library may assume that a text doesn't have invalid sequence in UTF-8.

Could you re-try in them non-UTF8 locales?

Norihiro
[backtrace.log (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#17245; Package grep. (Sat, 12 Apr 2014 16:24:02 GMT) Full text and rfc822 format available.

Message #11 received at 17245 <at> debbugs.gnu.org (full text, mbox):

From: damon <dh <at> bug-grep.usrbin.org>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 17245 <at> debbugs.gnu.org
Subject: Re: bug#17245: GREP BUG: grep -P and binary files
Date: Sat, 12 Apr 2014 09:22:33 -0700
Hi there -

Bingo, that does it.  I have LANG set to en_CA.utf-8.  If i run:

LANG=en_CA grep-2.18/src/grep -P '\[.?max' test-image.png

It works fine, reporting no match.

Same for every other version i have compiled.

So definitely utf-8 related.  Let me know if i can provide anything
else.

-damon

On 13 Apr, Norihiro Tanaka wrote:

> This bug is similar to bug#16586.
>
> It seems that the pointer `eptr' for a current position in a text
> exceeded the starting position in backword searching.  I seem that PCRE
> library may assume that a text doesn't have invalid sequence in UTF-8.
>
> Could you re-try in them non-UTF8 locales?
>
> Norihiro

> $ gdb src/grep core.1430
> GNU gdb (GDB) 7.6.2
> Copyright (C) 2013 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later
> <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "i386-pc-linux-gnu".
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>...
> Reading symbols from /home/staff/b/grep-2.18/src/grep...done.
> [New LWP 1430]
>
> warning: Can't read pathname for load map: Input/output error.
> Core was generated by `src/grep -P .?ma test-image.png'.
> Program terminated with signal 11, Segmentation fault.
> #0  0x001612ca in match (eptr=0x9a24fff <Address 0x9a24fff out of bounds>,
> ecode=0x9a25e65 "\035m\035ax",
>     mstart=0x9a26e9d
> "\272\374;\017\233\323\230:\364\005+\373a&\367\032X\304\216
> \342y\274\301\357\361\005",
>     offset_top=2, md=0xbfe18a64, eptrb=0x0, rdepth=0) at pcre_exec.c:5943
> 5943              BACKCHAR(eptr);
> (gdb) bt
> #0  0x001612ca in match (eptr=0x9a24fff <Address 0x9a24fff out of bounds>,
> ecode=0x9a25e65 "\035m\035ax",
>     mstart=0x9a26e9d
> "\272\374;\017\233\323\230:\364\005+\373a&\367\032X\304\216
> \342y\274\301\357\361\005",
>     offset_top=2, md=0xbfe18a64, eptrb=0x0, rdepth=0) at pcre_exec.c:5943
> #1  0x0016308a in pcre_exec (argument_re=0x9a25e28, extra_data=0x9a25e78,
>     subject=0x9a26e9d
> "\272\374;\017\233\323\230:\364\005+\373a&\367\032X\304\216
> \342y\274\301\357\361\005",
>     length=101, start_offset=0, options=8192, offsets=0xbfe18bdc,
> offsetcount=300) at pcre_exec.c:6941
> #2  0x0805a472 in Pexecute (buf=0x9a26000 "\211PNG\r\n\032\n", size=6568,
> match_size=0xbfe19114, start_ptr=0x0)
>     at pcresearch.c:174
> #3  0x0804ba07 in do_execute (buf=0x9a26000 "\211PNG\r\n\032\n", size=6568,
> match_size=0xbfe19114, start_ptr=0x0)
>     at grep.c:1073
> #4  0x0804bc98 in grepbuf (beg=0x9a26000 "\211PNG\r\n\032\n",
>     lim=0x9a279a8
> "\217\222(\016\001c\025R\221c\233S\250\327\177m\002\344Q\022\362$\320\066\3
> 76\327\245{\f\035D\001\260\251\326a\247{T\200_\bj8\274") at grep.c:1109
> #5  0x0804bfb3 in grep (fd=3, st=0xbfe19200) at grep.c:1220
> #6  0x0804c9ab in grepdesc (desc=3, command_line=1) at grep.c:1474
> #7  0x0804c650 in grepfile (dirdesc=-100, name=0xbfe19889 "test-image.png",
> follow=1, command_line=1) at grep.c:1375
> #8  0x0804cc22 in grep_command_line_arg (arg=0xbfe19889 "test-image.png") at
> grep.c:1526
> #9  0x0804e358 in main (argc=4, argv=0xbfe194a4) at grep.c:2362



--
Damon Harper           _/\_    Nothing is as simple as it seems at
damon <at> usrbin.ca      __\  /__  first, as hopeless as it seems in
                     \      /  the middle, or as finished as it
www.usrbin.ca/damon   |/||\|   seems in the end.




Information forwarded to bug-grep <at> gnu.org:
bug#17245; Package grep. (Sun, 13 Apr 2014 19:14:03 GMT) Full text and rfc822 format available.

Message #14 received at 17245 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: damon <dh <at> bug-grep.usrbin.org>
Cc: 17245 <at> debbugs.gnu.org
Subject: Re: bug#17245: GREP BUG: grep -P and binary files
Date: Sun, 13 Apr 2014 12:13:12 -0700
On Fri, Apr 11, 2014 at 4:47 PM, damon <dh <at> bug-grep.usrbin.org> wrote:
> Hi there -
>
> I recently noticed a bug after upgrading grep and have tracked it
> through a few versions now.
>
> I was using grep -P (PCRE grep) in some scripts to grep through
> directory of files, and the process would keep aborting with a
> segmentation fault.
>
> The last known good version is grep-2.14.  Every version after that has
> failed in a slightly different way, making me think this could be a bug
> in grep, not in pcre.
>
> I tried compiling greps 2.14 through 2.18 against the latest pcre
> library, pcre-8.33.  Here's what happens when i try each version against
> a random binary file, attached to this message as test-image.png.  This
> file was just one of many that caused the errors, though not every
> binary file does.
>
> Below are some results demonstrating what's going wrong.  Note that all
> of these seem to work fine with regular grep or with grep -E.  Please
> let me know what else i can do to help track this down!
>
> # grep-2.14/src/grep -P '\[.?max' test-image.png
> (works, does not match)
...
> # grep-2.18/src/grep -P '\[.?max' test-image.png
> Segmentation fault
>
> # grep-2.18/src/grep -P '.?ma' test-image.png
> Segmentation fault
>
> # grep-2.18/src/grep -P '.?m' test-image.png
> Binary file test-image.png matches

Thank you for the bug report.
That is due to a bug in libpcre.  I've confirmed that it is still
triggered even when using the latest grep.git linked with
the latest from pcre.git (latest commit has "Final tidies for
8.35 release." as the subject).  I built grep as usual, and
then ran this:

  rm src/grep; make LIB_PCRE=$PWD/../pcre/.libs/libpcre.a

Confirm that grep is not using a shared libpcre (this must print nothing):

  ldd src/grep|grep pcre

That presumes I had already built the latest pcre/ in ../pcre.
Then, run this to test it with a non-UTF8 locale, and it is
error-free, correctly finding no match:

  LC_ALL=ja_JP.eucJP valgrind src/grep -P '\[.?max' test-image.png

Repeat using a UTF8 locale, and you see that valgrind reports
numerous buffer overrun and heap-use-after-free errors:

  LC_ALL=en_US.utf8 valgrind src/grep -P '\[.?max' test-image.png

Here is an equivalent but much smaller test case:

  $ printf 'a\201b\r'|LC_ALL=en_US.utf8 valgrind src/grep -P 'a.?XXb'

That segfaults.  Interestingly, if I replace each X with a ".",
grep gets into an infinite loop within libpcre's match function.




Information forwarded to bug-grep <at> gnu.org:
bug#17245; Package grep. (Sun, 13 Apr 2014 23:18:02 GMT) Full text and rfc822 format available.

Message #17 received at 17245 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: damon <dh <at> bug-grep.usrbin.org>
Cc: 17245 <17245 <at> debbugs.gnu.org>
Subject: Re: bug#17245: GREP BUG: grep -P and binary files
Date: Sun, 13 Apr 2014 16:17:25 -0700
[Message part 1 (text/plain, inline)]
On Sun, Apr 13, 2014 at 12:13 PM, Jim Meyering <jim <at> meyering.net> wrote:
> On Fri, Apr 11, 2014 at 4:47 PM, damon <dh <at> bug-grep.usrbin.org> wrote:
>> Hi there -
>>
>> I recently noticed a bug after upgrading grep and have tracked it
>> through a few versions now.
>>
>> I was using grep -P (PCRE grep) in some scripts to grep through
>> directory of files, and the process would keep aborting with a
>> segmentation fault.
>>
>> The last known good version is grep-2.14.  Every version after that has
>> failed in a slightly different way, making me think this could be a bug
>> in grep, not in pcre.
>>
>> I tried compiling greps 2.14 through 2.18 against the latest pcre
>> library, pcre-8.33.  Here's what happens when i try each version against
>> a random binary file, attached to this message as test-image.png.  This
>> file was just one of many that caused the errors, though not every
>> binary file does.
>>
>> Below are some results demonstrating what's going wrong.  Note that all
>> of these seem to work fine with regular grep or with grep -E.  Please
>> let me know what else i can do to help track this down!
>>
>> # grep-2.14/src/grep -P '\[.?max' test-image.png
>> (works, does not match)
> ...
>> # grep-2.18/src/grep -P '\[.?max' test-image.png
>> Segmentation fault
>>
>> # grep-2.18/src/grep -P '.?ma' test-image.png
>> Segmentation fault
>>
>> # grep-2.18/src/grep -P '.?m' test-image.png
>> Binary file test-image.png matches
>
> Thank you for the bug report.
> That is due to a bug in libpcre.  I've confirmed that it is still
> triggered even when using the latest grep.git linked with
> the latest from pcre.git (latest commit has "Final tidies for
> 8.35 release." as the subject).  I built grep as usual, and
> then ran this:
>
>   rm src/grep; make LIB_PCRE=$PWD/../pcre/.libs/libpcre.a
>
> Confirm that grep is not using a shared libpcre (this must print nothing):
>
>   ldd src/grep|grep pcre
>
> That presumes I had already built the latest pcre/ in ../pcre.
> Then, run this to test it with a non-UTF8 locale, and it is
> error-free, correctly finding no match:
>
>   LC_ALL=ja_JP.eucJP valgrind src/grep -P '\[.?max' test-image.png
>
> Repeat using a UTF8 locale, and you see that valgrind reports
> numerous buffer overrun and heap-use-after-free errors:
>
>   LC_ALL=en_US.utf8 valgrind src/grep -P '\[.?max' test-image.png
>
> Here is an equivalent but much smaller test case:
>
>   $ printf 'a\201b\r'|LC_ALL=en_US.utf8 valgrind src/grep -P 'a.?XXb'
>
> That segfaults.  Interestingly, if I replace each X with a ".",
> grep gets into an infinite loop within libpcre's match function.

FYI, I'm pushing the attached patch, to add a test for this.
It fails with the latest pcre from git (8.35), but passes with debian
unstable's libpcre3 8.31-3:
[k.txt (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#17245; Package grep. (Tue, 15 Apr 2014 23:49:02 GMT) Full text and rfc822 format available.

Message #20 received at 17245 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: 17245 <at> debbugs.gnu.org
Subject: bug#17245: GREP BUG: grep -P and binary files
Date: Wed, 16 Apr 2014 08:48:44 +0900
I confirmed that this bug is also avoided by re-compiling PCRE with
--enable-git option.

PCRE without --enable-git:
$ env LC_ALL=en_US.utf8 src/grep -P '.?ma' test-image.png
Segmentation fault (core dumped)

PCRE with --enable-git:
$ env LC_ALL=en_US.utf8 src/grep -P '.?ma' test-image.png
Binary file ../test-image.png matches





Information forwarded to bug-grep <at> gnu.org:
bug#17245; Package grep. (Tue, 15 Apr 2014 23:59:02 GMT) Full text and rfc822 format available.

Message #23 received at 17245 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, 17245 <at> debbugs.gnu.org
Subject: Re: bug#17245: GREP BUG: grep -P and binary files
Date: Tue, 15 Apr 2014 16:58:22 -0700
Norihiro Tanaka wrote:
> I confirmed that this bug is also avoided by re-compiling PCRE with
> --enable-git option.

Sorry, what's --enable-git?




Information forwarded to bug-grep <at> gnu.org:
bug#17245; Package grep. (Wed, 16 Apr 2014 00:04:02 GMT) Full text and rfc822 format available.

Message #26 received at 17245 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 17245 <at> debbugs.gnu.org
Subject: Re: bug#17245: GREP BUG: grep -P and binary files
Date: Tue, 15 Apr 2014 17:03:00 -0700
[Message part 1 (text/plain, inline)]
On Tue, Apr 15, 2014 at 4:48 PM, Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> I confirmed that this bug is also avoided by re-compiling PCRE with
> --enable-git option.
>
> PCRE without --enable-git:
> $ env LC_ALL=en_US.utf8 src/grep -P '.?ma' test-image.png
> Segmentation fault (core dumped)
>
> PCRE with --enable-git:
> $ env LC_ALL=en_US.utf8 src/grep -P '.?ma' test-image.png
> Binary file ../test-image.png matches

Thank you.
I presume you meant --enable-jit.
However, even when building the latest pcre like this:

  ./configure --enable-unicode-properties --enable-utf8 --enable-jit && make

and linking grep with its resulting .a file, my new pcre-infloop test
still failed.
However, with the attached patch to pcre, it passes:
[k.txt (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#17245; Package grep. (Wed, 16 Apr 2014 12:14:01 GMT) Full text and rfc822 format available.

Message #29 received at 17245 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: 17245 <at> debbugs.gnu.org
Subject: bug#17245: GREP BUG: grep -P and binary files
Date: Wed, 16 Apr 2014 21:13:54 +0900
Jim Meyering wrote:
> I presume you meant --enable-jit.

Sorry, you are right. It's --enable-jit.
I reported it to PCRE project.

http://bugs.exim.org/show_bug.cgi?id=1468





Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Mon, 21 Apr 2014 18:04:04 GMT) Full text and rfc822 format available.

Notification sent to damon <dh <at> bug-grep.usrbin.org>:
bug acknowledged by developer. (Mon, 21 Apr 2014 18:04:05 GMT) Full text and rfc822 format available.

Message #34 received at 17245-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, 17245-done <at> debbugs.gnu.org, 
 16586-done <at> debbugs.gnu.org
Subject: Re: bug#17245: GREP BUG: grep -P and binary files
Date: Mon, 21 Apr 2014 11:03:10 -0700
[Message part 1 (text/plain, inline)]
On 04/16/2014 05:13 AM, Norihiro Tanaka wrote:
> http://bugs.exim.org/show_bug.cgi?id=1468 

Thanks.  The response there makes it clear that if grep passes arbitrary 
binary data to PCRE, and if grep uses PCRE_NO_UTF8_CHECK, undefined 
behavior will result (maybe infinite loop, core dump, etc.).  We can't 
have undefined behavior in grep.  A simple fix is to avoid using 
PCRE_NO_UTF8_CHECK so I installed the attached patch to do that.  
Perhaps we can think of a better way at some point.  In the meantime I'm 
taking the liberty of closing Bug#17245 and Bug#16586.
[0001-grep-P-now-rejects-invalid-input-sequences-in-UTF-8-.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#17245; Package grep. (Mon, 21 Apr 2014 22:09:01 GMT) Full text and rfc822 format available.

Message #37 received at 17245-done <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 17245-done <at> debbugs.gnu.org
Subject: bug#17245: GREP BUG: grep -P and binary files
Date: Tue, 22 Apr 2014 07:08:45 +0900
Paul Eggert wrote:
fix is to avoid using PCRE_NO_UTF8_CHECK.

Thanks.  I also agree with your thoughts.





Information forwarded to bug-grep <at> gnu.org:
bug#17245; Package grep. (Thu, 24 Apr 2014 02:32:03 GMT) Full text and rfc822 format available.

Message #40 received at 17245-done <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: 16586 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>, 
 Santiago <santiago <at> debian.org>
Cc: 17245-done <at> debbugs.gnu.org, Norihiro Tanaka <noritnk <at> kcn.ne.jp>,
 16586-done <at> debbugs.gnu.org
Subject: Re: bug#16586: bug#17245: GREP BUG: grep -P and binary files
Date: Wed, 23 Apr 2014 19:30:46 -0700
[Message part 1 (text/plain, inline)]
On Mon, Apr 21, 2014 at 11:03 AM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 04/16/2014 05:13 AM, Norihiro Tanaka wrote:
>>
>> http://bugs.exim.org/show_bug.cgi?id=1468
>
>
> Thanks.  The response there makes it clear that if grep passes arbitrary
> binary data to PCRE, and if grep uses PCRE_NO_UTF8_CHECK, undefined behavior
> will result (maybe infinite loop, core dump, etc.).  We can't have undefined
> behavior in grep.  A simple fix is to avoid using PCRE_NO_UTF8_CHECK so I
> installed the attached patch to do that.  Perhaps we can think of a better
> way at some point.  In the meantime I'm taking the liberty of closing
> Bug#17245 and Bug#16586.

Thanks for the patch, but I'm not sure I like the consequences:
that anyone using grep -P to search data that is even a tiny bit
inconsistent with their UTF-8 locale will now get an exit status of
2 rather than the matches they used to get. I would prefer to test for
working PCRE support and disable -P if it is deemed inadequate,
but that may have to wait for the release of a new version of
libpcre.

In any case, I found that this additional change is required,
at least on OS/X, to avoid a test failure:
[k.txt (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#17245; Package grep. (Thu, 24 Apr 2014 05:40:03 GMT) Full text and rfc822 format available.

Message #43 received at 17245 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>, 16586 <at> debbugs.gnu.org, 
 Santiago <santiago <at> debian.org>
Cc: 17245 <at> debbugs.gnu.org, Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Subject: Re: bug#16586: bug#17245: GREP BUG: grep -P and binary files
Date: Wed, 23 Apr 2014 22:39:10 -0700
Jim Meyering wrote:
> anyone using grep -P to search data that is even a tiny bit
> inconsistent with their UTF-8 locale will now get an exit status of
> 2 rather than the matches they used to get.

Yes, I don't like that either, but <http://bugs.exim.org/1468> says 
libpcre intends to have undefined behavior here.  If so, it wouldn't 
help to wait until the next libprce release, which may well have a 
serious bug of this form in a different area, a bug that's not easy to 
test for.

Perhaps somebody should modify grep -P to discard input lines containing 
non-UTF-8 data instead of presenting them to libprce.  That way, it 
would be safe for grep -P to use PCRE_NO_UTF8_CHECK.  Although grep -P 
should report an error and exit with status 2 if it discards input due 
to encoding errors, it can also report matches in lines that do not 
contain encoding errors, so that users can see both the error messages 
and the matches.





Information forwarded to bug-grep <at> gnu.org:
bug#17245; Package grep. (Thu, 24 Apr 2014 15:30:03 GMT) Full text and rfc822 format available.

Message #46 received at 17245 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 17245 <17245 <at> debbugs.gnu.org>, Santiago <santiago <at> debian.org>,
 16586 <at> debbugs.gnu.org, Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Subject: Re: bug#16586: bug#17245: GREP BUG: grep -P and binary files
Date: Thu, 24 Apr 2014 08:29:07 -0700
On Wed, Apr 23, 2014 at 10:39 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> Jim Meyering wrote:
>>
>> anyone using grep -P to search data that is even a tiny bit
>> inconsistent with their UTF-8 locale will now get an exit status of
>> 2 rather than the matches they used to get.
>
>
> Yes, I don't like that either, but <http://bugs.exim.org/1468> says libpcre

Oh! I had not read that. That is disappointing.

> intends to have undefined behavior here.  If so, it wouldn't help to wait
> until the next libprce release, which may well have a serious bug of this
> form in a different area, a bug that's not easy to test for.

Indeed.

> Perhaps somebody should modify grep -P to discard input lines containing
> non-UTF-8 data instead of presenting them to libprce.  That way, it would be
> safe for grep -P to use PCRE_NO_UTF8_CHECK.  Although grep -P should report
> an error and exit with status 2 if it discards input due to encoding errors,
> it can also report matches in lines that do not contain encoding errors, so
> that users can see both the error messages and the matches.

That sounds reasonable, but I don't like the requirement that
one make two passes over each subject text.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 23 May 2014 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 9 years and 348 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.