GNU bug report logs - #18806
grep -rP getline crashes prematurely (without displaying all results) on invalid UTF-8 input with LC_ALL=en_US.UTF-8

Previous Next

Package: grep;

Reported by: Shlomi Fish <shlomif <at> shlomifish.org>

Date: Thu, 23 Oct 2014 11:16:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 18806 in the body.
You can then email your comments to 18806 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#18806; Package grep. (Thu, 23 Oct 2014 11:16:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Shlomi Fish <shlomif <at> shlomifish.org>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Thu, 23 Oct 2014 11:16:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Shlomi Fish <shlomif <at> shlomifish.org>
To: bug-grep <at> gnu.org
Subject: grep -rP getline crashes prematurely (without displaying all
 results) on invalid UTF-8 input with LC_ALL=en_US.UTF-8
Date: Thu, 23 Oct 2014 14:15:16 +0300
Hi all,

see:

https://github.com/shlomif/grep-bug-big5-utf8-from-IO-All

You can cd to the directory and run "bash -x REPRODUCE.bash" (After seeing
that it does not do anything harmful). I am getting:

shlomif <at> telaviv1:~/GREP-test$ LC_ALL=en_US.UTF-8 grep -rP getline grep-test/

grep-test/round_robin.t:while (my $line = $io->getline || $io->getlinegrep:
internal PCRE error: -32

with the latest git grep.

Regards,

	Shlomi Fish
-- 
-----------------------------------------------------------------
Shlomi Fish       http://www.shlomifish.org/
UNIX Fortune Cookies - http://www.shlomifish.org/humour/fortunes/

Xena the warrior princess can meet King David for breakfast and Julius Caesar
for lunch. Without time travel.

Please reply to list if it's a mailing list post - http://shlom.in/reply .




Information forwarded to bug-grep <at> gnu.org:
bug#18806; Package grep. (Thu, 23 Oct 2014 21:07:02 GMT) Full text and rfc822 format available.

Message #8 received at 18806 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Shlomi Fish <shlomif <at> shlomifish.org>, 18806 <at> debbugs.gnu.org
Subject: Re: bug#18806: grep -rP getline crashes prematurely (without
 displaying all results) on invalid UTF-8 input with LC_ALL=en_US.UTF-8
Date: Thu, 23 Oct 2014 14:06:32 -0700
On 10/23/2014 04:15 AM, Shlomi Fish wrote:
> internal PCRE error: -32
>
> with the latest git grep.

I am not seeing a problem with that test case on my platform.  I am 
running Fedora 20 x86-64, and compiled grep with GCC 4.9.1 (which I 
built myself) and linked with the standard Fedora package 
pcre-8.33-6.fc20.x86_64.  I also tried building with the Fedora GCC in 
32-bit mode, and couldn't reproduce the bug there either.

Possibly it's a libpcre problem?

I tested with grep commit b2490802defe3c3bf7ef0036a4515d006a08a769 and 
grep-bug-big5-utf8-from-IO-All commit 
9469e6e5be97d631c02bcfdbe814f43d1bb2df56.




Information forwarded to bug-grep <at> gnu.org:
bug#18806; Package grep. (Fri, 24 Oct 2014 09:27:02 GMT) Full text and rfc822 format available.

Message #11 received at 18806 <at> debbugs.gnu.org (full text, mbox):

From: Shlomi Fish <shlomif <at> shlomifish.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 18806 <at> debbugs.gnu.org
Subject: Re: bug#18806: grep -rP getline crashes prematurely (without
 displaying all results) on invalid UTF-8 input with LC_ALL=en_US.UTF-8
Date: Fri, 24 Oct 2014 12:26:32 +0300
Hi Mr. Eggert,

On Thu, 23 Oct 2014 14:06:32 -0700
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> On 10/23/2014 04:15 AM, Shlomi Fish wrote:
> > internal PCRE error: -32
> >
> > with the latest git grep.
> 
> I am not seeing a problem with that test case on my platform.  I am 
> running Fedora 20 x86-64, and compiled grep with GCC 4.9.1 (which I 
> built myself) and linked with the standard Fedora package 
> pcre-8.33-6.fc20.x86_64.  I also tried building with the Fedora GCC in 
> 32-bit mode, and couldn't reproduce the bug there either.
> 
> Possibly it's a libpcre problem?

I discovered a slightly different test case for it. Try running:

«
`which grep` --color -rP getline grep-test
»

From the command line. See this for a screenshot on Fedora:

* http://www.shlomifish.org/Files/files/images/gnu-grep-on-fedora.png

> 
> I tested with grep commit b2490802defe3c3bf7ef0036a4515d006a08a769 and 
> grep-bug-big5-utf8-from-IO-All commit 
> 9469e6e5be97d631c02bcfdbe814f43d1bb2df56.

I tested with grep commit b2490802defe3c3bf7ef0036a4515d006a08a769 .

Regards,

	Shlomi Fish


-- 
-----------------------------------------------------------------
Shlomi Fish       http://www.shlomifish.org/
http://www.shlomifish.org/humour/bits/facts/Summer-Glau/

Tomorrow never dies, unless Chuck Norris volunteers to take it out of its
misery.
    — http://www.shlomifish.org/humour/bits/facts/Chuck-Norris/

Please reply to list if it's a mailing list post - http://shlom.in/reply .




Information forwarded to bug-grep <at> gnu.org:
bug#18806; Package grep. (Fri, 24 Oct 2014 16:46:02 GMT) Full text and rfc822 format available.

Message #14 received at 18806 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Shlomi Fish <shlomif <at> shlomifish.org>
Cc: 18806 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: bug#18806: grep -rP getline crashes prematurely (without
 displaying all results) on invalid UTF-8 input with LC_ALL=en_US.UTF-8
Date: Sat, 25 Oct 2014 01:45:42 +0900
[Message part 1 (text/plain, inline)]
Shlomi Fish <shlomif <at> shlomifish.org> write:
> `which grep` --color -rP getline grep-test

If -o or --color option is specified, may be line_end < validated in
longest exact match.  As a result, a negative value is set to
`search_bytes'.

I improved validation for input buffer in order to fix the bug.
However, possibly it may cause slowdown.
[0001-grep-improvement-of-validation-for-input-buffer-in-g.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#18806; Package grep. (Fri, 24 Oct 2014 16:51:02 GMT) Full text and rfc822 format available.

Message #17 received at 18806 <at> debbugs.gnu.org (full text, mbox):

From: Shlomi Fish <shlomif <at> shlomifish.org>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 18806 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: bug#18806: grep -rP getline crashes prematurely (without
 displaying all results) on invalid UTF-8 input with LC_ALL=en_US.UTF-8
Date: Fri, 24 Oct 2014 19:50:26 +0300
On Sat, 25 Oct 2014 01:45:42 +0900
Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:

> Shlomi Fish <shlomif <at> shlomifish.org> write:
> > `which grep` --color -rP getline grep-test
> 
> If -o or --color option is specified, may be line_end < validated in
> longest exact match.  As a result, a negative value is set to
> `search_bytes'.
> 
> I improved validation for input buffer in order to fix the bug.
> However, possibly it may cause slowdown.

thanks for the patch!

Regards,

	-- Shlomi Fish

-- 
-----------------------------------------------------------------
Shlomi Fish       http://www.shlomifish.org/
Star Trek: “We, the Living Dead” - http://shlom.in/st-wtld

Yesterday I asked one of my students if she knew what an encyclopedia is, and
she said: “Is it something like Wikipedia?”.
        — http://twitter.com/alisonclement/status/8421314259

Please reply to list if it's a mailing list post - http://shlom.in/reply .




Information forwarded to bug-grep <at> gnu.org:
bug#18806; Package grep. (Fri, 24 Oct 2014 17:24:01 GMT) Full text and rfc822 format available.

Message #20 received at 18806 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: 18806 <at> debbugs.gnu.org
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, Shlomi Fish <shlomif <at> shlomifish.org>
Subject: Re: bug#18806: grep -rP getline crashes prematurely (without
 displaying all results) on invalid UTF-8 input with LC_ALL=en_US.UTF-8
Date: Sat, 25 Oct 2014 02:23:24 +0900
[Message part 1 (text/plain, inline)]
I added a rule to run the test to the patch.
[0001-grep-improvement-of-validation-for-input-buffer-in-g.patch (text/plain, attachment)]

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Fri, 24 Oct 2014 20:39:03 GMT) Full text and rfc822 format available.

Notification sent to Shlomi Fish <shlomif <at> shlomifish.org>:
bug acknowledged by developer. (Fri, 24 Oct 2014 20:39:04 GMT) Full text and rfc822 format available.

Message #25 received at 18806-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, 18806-done <at> debbugs.gnu.org
Cc: Shlomi Fish <shlomif <at> shlomifish.org>
Subject: Re: bug#18806: grep -rP getline crashes prematurely (without
 displaying all results) on invalid UTF-8 input with LC_ALL=en_US.UTF-8
Date: Fri, 24 Oct 2014 13:38:19 -0700
[Message part 1 (text/plain, inline)]
Thanks for looking into this.  I added that test case, but took a 
more-conservative approach to fixing the bug, by disabling the 
optimization that's causing this problem; please see attached patches.  
The optimization was a hack anyway, and these bugs suggest that it's not 
a hack worth keeping.
[0001-grep-fix-grep-P-crash.patch (text/x-patch, attachment)]
[0002-tests-add-test-for-grep-P-fix.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#18806; Package grep. (Fri, 24 Oct 2014 23:59:01 GMT) Full text and rfc822 format available.

Message #28 received at 18806-done <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 18806-done <at> debbugs.gnu.org, Shlomi Fish <shlomif <at> shlomifish.org>
Subject: Re: bug#18806: grep -rP getline crashes prematurely (without
 displaying all results) on invalid UTF-8 input with LC_ALL=en_US.UTF-8
Date: Sat, 25 Oct 2014 08:58:02 +0900
Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> The optimization was a hack anyway, and these bugs suggest that it's>
> not a hack worth keeping.

Thanks.  I improved this hack instead of removal, but it had little
effect in comparison with your patch, even if `execute' ran for each
character.

$ yes j | head -1000000 >k

(Current master): 
$ time -p src/grep -Po j ../k >/dev/null
real 0.46  user 0.40  sys 0.02

(My patch):
$ time -p src/grep -Po j ../k >/dev/null
real 0.46  user 0.41  sys 0.01





Information forwarded to bug-grep <at> gnu.org:
bug#18806; Package grep. (Sat, 25 Oct 2014 07:23:04 GMT) Full text and rfc822 format available.

Message #31 received at 18806-done <at> debbugs.gnu.org (full text, mbox):

From: Shlomi Fish <shlomif <at> shlomifish.org>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 18806-done <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: bug#18806: grep -rP getline crashes prematurely (without
 displaying all results) on invalid UTF-8 input with LC_ALL=en_US.UTF-8
Date: Sat, 25 Oct 2014 10:22:07 +0300
Hi all,

On Sat, 25 Oct 2014 08:58:02 +0900
Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:

> Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> > The optimization was a hack anyway, and these bugs suggest that it's>
> > not a hack worth keeping.
> 
> Thanks.  I improved this hack instead of removal, but it had little
> effect in comparison with your patch, even if `execute' ran for each
> character.
> 

Thank you both for the fix. ♥!

Regards,

	Shlomi Fish


-- 
-----------------------------------------------------------------
Shlomi Fish       http://www.shlomifish.org/
Optimising Code for Speed - http://shlom.in/optimise

Chuck Norris has 99 problems including a bitch.
    — http://www.shlomifish.org/humour/bits/facts/Chuck-Norris/

Please reply to list if it's a mailing list post - http://shlom.in/reply .




Information forwarded to bug-grep <at> gnu.org:
bug#18806; Package grep. (Sat, 25 Oct 2014 18:08:02 GMT) Full text and rfc822 format available.

Message #34 received at 18806 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: 18806 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>, 
 Shlomi Fish <shlomif <at> shlomifish.org>
Cc: 18806-done <at> debbugs.gnu.org, Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Subject: Re: bug#18806: grep -rP getline crashes prematurely (without
 displaying all results) on invalid UTF-8 input with LC_ALL=en_US.UTF-8
Date: Sat, 25 Oct 2014 11:06:51 -0700
On Fri, Oct 24, 2014 at 1:38 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> Thanks for looking into this.  I added that test case, but took a
> more-conservative approach to fixing the bug, by disabling the optimization
> that's causing this problem; please see attached patches.  The optimization
> was a hack anyway, and these bugs suggest that it's not a hack worth
> keeping.

Hi Paul,
At first I thought "ok, either way."  But then I found that after your change,
our pcre-invalid-utf8-input hangs. That happens because the following
infloops (stuck in pcre_exec) on a CentOS6 system:

  printf 'j\202j\nj\nk\202\n' > in; LC_ALL=en_US.utf8 src/grep -P 'k$' in

That binary was linked with the libpcre from this package:

  pcre-7.8-4.el6.x86_64




Information forwarded to bug-grep <at> gnu.org:
bug#18806; Package grep. (Sat, 25 Oct 2014 18:08:03 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#18806; Package grep. (Sat, 25 Oct 2014 23:12:02 GMT) Full text and rfc822 format available.

Message #40 received at 18806 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>, 18806 <at> debbugs.gnu.org, 
 Shlomi Fish <shlomif <at> shlomifish.org>
Cc: 18806-done <at> debbugs.gnu.org, Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Subject: Re: bug#18806: grep -rP getline crashes prematurely (without
 displaying all results) on invalid UTF-8 input with LC_ALL=en_US.UTF-8
Date: Sat, 25 Oct 2014 16:11:33 -0700
[Message part 1 (text/plain, inline)]
Jim Meyering wrote:
> after your change,
> our pcre-invalid-utf8-input hangs. That happens because the following
> infloops (stuck in pcre_exec) on a CentOS6 system:
>
>    printf 'j\202j\nj\nk\202\n' > in; LC_ALL=en_US.utf8 src/grep -P 'k$' in
>
> That binary was linked with the libpcre from this package:
>
>    pcre-7.8-4.el6.x86_64

I'm getting a failure in pcre-invalid-utf8-input both before and after the 
change, with CentOS 6.5 and pcre-7.8-6.el6.x86_64.  In my case the failures are 
segmentation violations; perhaps 7.8-4 has a different failure mode, or perhaps 
there's some other minor change to your platform that causes libpcre to infloop. 
 Either way, this appears to be a PCRE bug that grep can't be expected to work 
around.

Does the attached patch cause the test to fail reliably for you, instead of looping?

By the way, I'm not sure why tests distinguish between require_en_utf8_locale_ 
and require_compiled_in_MB_support; the latter requires the former, and there's 
no point requiring the former unless we also require the latter.

[pcre.diff (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#18806; Package grep. (Sat, 25 Oct 2014 23:12:03 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#18806; Package grep. (Sun, 26 Oct 2014 01:25:01 GMT) Full text and rfc822 format available.

Message #46 received at 18806 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 18806 <18806 <at> debbugs.gnu.org>, 18806-done <18806-done <at> debbugs.gnu.org>,
 Norihiro Tanaka <noritnk <at> kcn.ne.jp>, Shlomi Fish <shlomif <at> shlomifish.org>
Subject: Re: bug#18806: grep -rP getline crashes prematurely (without
 displaying all results) on invalid UTF-8 input with LC_ALL=en_US.UTF-8
Date: Sat, 25 Oct 2014 18:24:24 -0700
On Sat, Oct 25, 2014 at 4:11 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> I'm getting a failure in pcre-invalid-utf8-input both before and after the
> change, with CentOS 6.5 and pcre-7.8-6.el6.x86_64.  In my case the failures
> are segmentation violations; perhaps 7.8-4 has a different failure mode, or
> perhaps there's some other minor change to your platform that causes libpcre
> to infloop.  Either way, this appears to be a PCRE bug that grep can't be
> expected to work around.
>
> Does the attached patch cause the test to fail reliably for you, instead of
> looping?

Yes.  And a timeout of 3s should be fine.  Thanks.  Please push that.

I've just built grep against the latest pcre from git (an Oct 10 commit with
this hash: cc48a55a5de9c2103f6657147149bcf63ff61579), and then all
of grep's tests pass.

Ideally, we would detect and warn about inadequate versions of pcre,
but that certainly need not block the release.

> By the way, I'm not sure why tests distinguish between
> require_en_utf8_locale_ and require_compiled_in_MB_support; the latter
> requires the former, and there's no point requiring the former unless we
> also require the latter.

It looks like I added the require_compiled_in_MB_support function in
grep commit v2.9-27-g46e5cc6, yet never realized that it subsumed
require_en_utf8_locale_.  You're welcome to clean up after the release.




Information forwarded to bug-grep <at> gnu.org:
bug#18806; Package grep. (Sun, 26 Oct 2014 01:25:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#18806; Package grep. (Sun, 26 Oct 2014 05:50:01 GMT) Full text and rfc822 format available.

Message #52 received at 18806 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>
Cc: 18806 <18806 <at> debbugs.gnu.org>, 18806-done <18806-done <at> debbugs.gnu.org>,
 Norihiro Tanaka <noritnk <at> kcn.ne.jp>, Shlomi Fish <shlomif <at> shlomifish.org>
Subject: Re: bug#18806: grep -rP getline crashes prematurely (without
 displaying all results) on invalid UTF-8 input with LC_ALL=en_US.UTF-8
Date: Sat, 25 Oct 2014 22:49:32 -0700
[Message part 1 (text/plain, inline)]
Jim Meyering wrote:
> Yes.  And a timeout of 3s should be fine.  Thanks.  Please push that.

Done, with the attached patch.
[0001-tests-work-around-older-libpcre-bugs-when-testing-P-.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#18806; Package grep. (Sun, 26 Oct 2014 05:50:03 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 23 Nov 2014 12:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 9 years and 156 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.