GNU bug report logs - #16586
grep: infinite loop in grep -P on some files with invalid UTF-8 sequences

Previous Next

Package: grep;

Reported by: Santiago <santiago <at> debian.org>

Date: Wed, 29 Jan 2014 09:46:02 UTC

Severity: important

Found in version 2.16

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Forwarded to Philip Hazel <ph10@hermes.cam.ac.uk>

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 16586 in the body.
You can then email your comments to 16586 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#16586; Package grep. (Wed, 29 Jan 2014 09:46:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Santiago <santiago <at> debian.org>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Wed, 29 Jan 2014 09:46:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Santiago <santiago <at> debian.org>
To: submit <at> debbugs.gnu.org
Subject: grep: infinite loop in grep -P on some files with invalid UTF-8
 sequences
Date: Wed, 29 Jan 2014 10:43:46 +0100
Package: grep
Version: 2.16
Severity: important

Hi there,

I forward this bug from debian's BTS. Last changes in -P brought another
problem. I've confirmed this behavior on last debian package:

----- Forwarded message from Vincent Lefevre <vincent <at> vinc17.net> -----

[snip]


grep -P loops on some files with invalid UTF-8 sequences, e.g.

$ /usr/bin/printf "\xe9\x65\n\xab\n" | grep -P '.e|.?z' | head
�e
�e
�e
�e
�e
�e
�e
�e
�e
�e

(the infinite loop is interrupted here by a broken pipe due to
the "head").

It seems that the fix of

  https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=730472

didn't solve all the problems.

-- System Information:
Debian Release: jessie/sid
  APT prefers unstable
  APT policy: (500, 'unstable'), (500, 'testing'), (500, 'stable'), (1, 'experimental')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 3.12-1-amd64 (SMP w/2 CPU cores)
Locale: LANG=POSIX, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages grep depends on:
ii  dpkg          1.17.6
ii  install-info  5.2.0.dfsg.1-2
ii  libc6         2.17-97
ii  libpcre3      1:8.31-2

grep recommends no packages.

grep suggests no packages.

-- no debconf information


----- End forwarded message -----




Information forwarded to bug-grep <at> gnu.org:
bug#16586; Package grep. (Mon, 03 Feb 2014 21:35:02 GMT) Full text and rfc822 format available.

Message #8 received at 16586 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Santiago <santiago <at> debian.org>
Cc: 16586 <at> debbugs.gnu.org
Subject: Re: bug#16586: grep: infinite loop in grep -P on some files with
 invalid UTF-8 sequences
Date: Mon, 3 Feb 2014 13:34:14 -0800
On Wed, Jan 29, 2014 at 1:43 AM, Santiago <santiago <at> debian.org> wrote:
> Package: grep
> Version: 2.16
> Severity: important
>
> Hi there,
>
> I forward this bug from debian's BTS. Last changes in -P brought another
> problem. I've confirmed this behavior on last debian package:
>
> ----- Forwarded message from Vincent Lefevre <vincent <at> vinc17.net> -----
>
> [snip]
>
>
> grep -P loops on some files with invalid UTF-8 sequences, e.g.
>
> $ /usr/bin/printf "\xe9\x65\n\xab\n" | grep -P '.e|.?z' | head
> �e
> �e
> �e
> �e
> �e
> �e
> �e
> �e
> �e
> �e
>
> (the infinite loop is interrupted here by a broken pipe due to
> the "head").
>
> It seems that the fix of
>
>   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=730472

Thanks for the heads-up.  That appears to be a problem with pcre.
I've just build grep (git head) against pcre (git head), and adjusted
your example slightly and built with gcc's address sanitizer mode.
Now, libpcre gets an internal segfault:

$ printf "\xe9\n\xab\n" > k; src/grep -P 'e|.?z' k
ASAN:SIGSEGV
=================================================================
==11821==ERROR: AddressSanitizer: SEGV on unknown address
0x62cfffffffff (pc 0x00\
00004f0743 sp 0x7fff6b32f4a0 bp 0x7fff6b32f760 T0)
    #0 0x4f0742 in match /w/co/pcre/pcre_exec.c:5943
    #1 0x4f26d5 in pcre_exec /w/co/pcre/pcre_exec.c:6941
    #2 0x46f421 in Pexecute /w/co/grep/src/pcresearch.c:178
    #3 0x4717a3 in do_execute /w/co/grep/src/main.c:1075
    #4 0x4717a3 in grepbuf /w/co/grep/src/main.c:1111
    #5 0x472249 in grep /w/co/grep/src/main.c:1222
    #6 0x472249 in grepdesc /w/co/grep/src/main.c:1476
    #7 0x4073ca in main /w/co/grep/src/main.c:2396
    #8 0x7f6f21a53cdc in __libc_start_main (/lib64/libc.so.6+0x1ecdc)
    #9 0x408a54 (/w/u/w/co/grep/src/grep+0x408a54)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /w/co/pcre/pcre_exec.c:5943 match
==11821==ABORTING

Sorry, but I don't have time to debug further.  Quick glance suggests
it is backing up too far:

(gdb) b __asan_report_error
Breakpoint 1 at 0x448c40: file
../../.././libsanitizer/asan/asan_report.cc, line 711.
(gdb) r
Starting program: /w/u/w/co/grep/src/grep -P e\|.\?z k
warning: no loadable sections found in added symbol-file
system-supplied DSO at 0x7ffff7ffa000
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Program received signal SIGSEGV, Segmentation fault.
0x00000000004f0743 in match (eptr=0x62cfffffffff "",
ecode=0x60700000df8a "\035zx",
    mstart=0x62d00000b002 "\253\n", '\276' <repeats 198 times>...,
offset_top=2, md=0x7fffffffce30, eptrb=0x0, rdepth=0)
    at pcre_exec.c:5943
5943              BACKCHAR(eptr);
(gdb) l
5938              {
5939              if (eptr == pp) goto TAIL_RECURSE;
5940              RMATCH(eptr, ecode, offset_top, md, eptrb, RM46);
5941              if (rrc != MATCH_NOMATCH) RRETURN(rrc);
5942              eptr--;
5943              BACKCHAR(eptr);
5944              if (ctype == OP_ANYNL && eptr > pp  && UCHAR21(eptr)
== CHAR_NL &&
5945                  UCHAR21(eptr - 1) == CHAR_CR) eptr--;
5946              }
5947            }




Information forwarded to bug-grep <at> gnu.org:
bug#16586; Package grep. (Sat, 08 Mar 2014 23:08:02 GMT) Full text and rfc822 format available.

Message #11 received at 16586 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: 16586 <at> debbugs.gnu.org
Subject: Re: grep: infinite loop in grep -P on some files with invalid UTF-8
 sequences
Date: Sat, 08 Mar 2014 15:07:00 -0800
For what it's worth I can't reproduce this bug on Fedora 20 x86-64, even 
with valgrind and/or GCC -faddress=sanitize.  I'm using Fedora 
pcre-8.33-4.fc20.x86_64.




Information forwarded to bug-grep <at> gnu.org:
bug#16586; Package grep. (Tue, 15 Apr 2014 14:11:01 GMT) Full text and rfc822 format available.

Message #14 received at 16586 <at> debbugs.gnu.org (full text, mbox):

From: Santiago <santiago <at> debian.org>
To: 736919 <at> bugs.debian.org, Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 16586 <at> debbugs.gnu.org
Subject: Re: grep: infinite loop in grep -P on some files with invalid UTF-8
 sequences
Date: Tue, 15 Apr 2014 16:10:29 +0200
On Sat, Mar 08, 2014 at 03:07:00PM -0800, Paul Eggert wrote:
> For what it's worth I can't reproduce this bug on Fedora 20 x86-64,
> even with valgrind and/or GCC -faddress=sanitize.  I'm using Fedora
> pcre-8.33-4.fc20.x86_64.
> 

Indeed, it was a debian-pcre-specific bug. New pcre package (1:8.31-3)
enables JIT regex compilation and solves the issue.

I'm updating grep's dependencies to close this bug in debian.

Regards,

Santiago




Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Tue, 15 Apr 2014 14:50:02 GMT) Full text and rfc822 format available.

Notification sent to Santiago <santiago <at> debian.org>:
bug acknowledged by developer. (Tue, 15 Apr 2014 14:50:03 GMT) Full text and rfc822 format available.

Message #19 received at 16586-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Santiago <santiago <at> debian.org>, 736919 <at> bugs.debian.org
Cc: 16586-done <at> debbugs.gnu.org
Subject: Re: bug#16586: grep: infinite loop in grep -P on some files with
 invalid UTF-8 sequences
Date: Tue, 15 Apr 2014 07:48:51 -0700
Santiago wrote:
> it was a debian-pcre-specific bug.

Thanks, closing the bug upstream.




Information forwarded to bug-grep <at> gnu.org:
bug#16586; Package grep. (Wed, 16 Apr 2014 16:26:01 GMT) Full text and rfc822 format available.

Message #22 received at 16586 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: 16586 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>, 
 Santiago <santiago <at> debian.org>, Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 736919 <at> bugs.debian.org
Subject: Re: bug#16586: grep: infinite loop in grep -P on some files with
 invalid UTF-8 sequences
Date: Wed, 16 Apr 2014 09:24:33 -0700
On Tue, Apr 15, 2014 at 7:48 AM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> Santiago wrote:
>> it was a debian-pcre-specific bug.
>
> Thanks, closing the bug upstream.

This bug is still present in upstream libpcre version 8.35.
I wrote a patch for it, posted at http://debbugs.gnu.org/17245#26
and Norihiro forwarded it on to the libpcre bug tracker here:
http://bugs.exim.org/show_bug.cgi?id=1468




Set bug forwarded-to-address to 'Philip Hazel <ph10 <at> hermes.cam.ac.uk>'. Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Wed, 16 Apr 2014 17:44:02 GMT) Full text and rfc822 format available.

Did not alter fixed versions and reopened. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Wed, 16 Apr 2014 17:48:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#16586; Package grep. (Wed, 16 Apr 2014 17:51:02 GMT) Full text and rfc822 format available.

Message #29 received at 16586 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>, 16586 <at> debbugs.gnu.org, 
 Santiago <santiago <at> debian.org>, Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 736919 <at> bugs.debian.org
Subject: Re: bug#16586: grep: infinite loop in grep -P on some files with
 invalid UTF-8 sequences
Date: Wed, 16 Apr 2014 10:50:20 -0700
Jim Meyering wrote:
> This bug is still present in upstream libpcre version 8.35.

Ah, sorry, I thought it was Debian-specific.  I've reopened grep bug 
16586 <http://bugs.gnu.org/16586>, and have forwarded it to Philip 
Hazel, who currently has the PCRE bug assigned, according to 
<http://bugs.exim.org/1468>.




Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Mon, 21 Apr 2014 18:04:02 GMT) Full text and rfc822 format available.

Notification sent to Santiago <santiago <at> debian.org>:
bug acknowledged by developer. (Mon, 21 Apr 2014 18:04:03 GMT) Full text and rfc822 format available.

Message #34 received at 16586-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, 17245-done <at> debbugs.gnu.org, 
 16586-done <at> debbugs.gnu.org
Subject: Re: bug#17245: GREP BUG: grep -P and binary files
Date: Mon, 21 Apr 2014 11:03:10 -0700
[Message part 1 (text/plain, inline)]
On 04/16/2014 05:13 AM, Norihiro Tanaka wrote:
> http://bugs.exim.org/show_bug.cgi?id=1468 

Thanks.  The response there makes it clear that if grep passes arbitrary 
binary data to PCRE, and if grep uses PCRE_NO_UTF8_CHECK, undefined 
behavior will result (maybe infinite loop, core dump, etc.).  We can't 
have undefined behavior in grep.  A simple fix is to avoid using 
PCRE_NO_UTF8_CHECK so I installed the attached patch to do that.  
Perhaps we can think of a better way at some point.  In the meantime I'm 
taking the liberty of closing Bug#17245 and Bug#16586.
[0001-grep-P-now-rejects-invalid-input-sequences-in-UTF-8-.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16586; Package grep. (Thu, 24 Apr 2014 02:32:01 GMT) Full text and rfc822 format available.

Message #37 received at 16586 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: 16586 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>, 
 Santiago <santiago <at> debian.org>
Cc: 17245-done <at> debbugs.gnu.org, Norihiro Tanaka <noritnk <at> kcn.ne.jp>,
 16586-done <at> debbugs.gnu.org
Subject: Re: bug#16586: bug#17245: GREP BUG: grep -P and binary files
Date: Wed, 23 Apr 2014 19:30:46 -0700
[Message part 1 (text/plain, inline)]
On Mon, Apr 21, 2014 at 11:03 AM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 04/16/2014 05:13 AM, Norihiro Tanaka wrote:
>>
>> http://bugs.exim.org/show_bug.cgi?id=1468
>
>
> Thanks.  The response there makes it clear that if grep passes arbitrary
> binary data to PCRE, and if grep uses PCRE_NO_UTF8_CHECK, undefined behavior
> will result (maybe infinite loop, core dump, etc.).  We can't have undefined
> behavior in grep.  A simple fix is to avoid using PCRE_NO_UTF8_CHECK so I
> installed the attached patch to do that.  Perhaps we can think of a better
> way at some point.  In the meantime I'm taking the liberty of closing
> Bug#17245 and Bug#16586.

Thanks for the patch, but I'm not sure I like the consequences:
that anyone using grep -P to search data that is even a tiny bit
inconsistent with their UTF-8 locale will now get an exit status of
2 rather than the matches they used to get. I would prefer to test for
working PCRE support and disable -P if it is deemed inadequate,
but that may have to wait for the release of a new version of
libpcre.

In any case, I found that this additional change is required,
at least on OS/X, to avoid a test failure:
[k.txt (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16586; Package grep. (Thu, 24 Apr 2014 02:32:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#16586; Package grep. (Thu, 24 Apr 2014 05:40:02 GMT) Full text and rfc822 format available.

Message #43 received at 16586 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>, 16586 <at> debbugs.gnu.org, 
 Santiago <santiago <at> debian.org>
Cc: 17245 <at> debbugs.gnu.org, Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Subject: Re: bug#16586: bug#17245: GREP BUG: grep -P and binary files
Date: Wed, 23 Apr 2014 22:39:10 -0700
Jim Meyering wrote:
> anyone using grep -P to search data that is even a tiny bit
> inconsistent with their UTF-8 locale will now get an exit status of
> 2 rather than the matches they used to get.

Yes, I don't like that either, but <http://bugs.exim.org/1468> says 
libpcre intends to have undefined behavior here.  If so, it wouldn't 
help to wait until the next libprce release, which may well have a 
serious bug of this form in a different area, a bug that's not easy to 
test for.

Perhaps somebody should modify grep -P to discard input lines containing 
non-UTF-8 data instead of presenting them to libprce.  That way, it 
would be safe for grep -P to use PCRE_NO_UTF8_CHECK.  Although grep -P 
should report an error and exit with status 2 if it discards input due 
to encoding errors, it can also report matches in lines that do not 
contain encoding errors, so that users can see both the error messages 
and the matches.





Information forwarded to bug-grep <at> gnu.org:
bug#16586; Package grep. (Thu, 24 Apr 2014 15:30:02 GMT) Full text and rfc822 format available.

Message #46 received at 16586 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 17245 <17245 <at> debbugs.gnu.org>, Santiago <santiago <at> debian.org>,
 16586 <at> debbugs.gnu.org, Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Subject: Re: bug#16586: bug#17245: GREP BUG: grep -P and binary files
Date: Thu, 24 Apr 2014 08:29:07 -0700
On Wed, Apr 23, 2014 at 10:39 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> Jim Meyering wrote:
>>
>> anyone using grep -P to search data that is even a tiny bit
>> inconsistent with their UTF-8 locale will now get an exit status of
>> 2 rather than the matches they used to get.
>
>
> Yes, I don't like that either, but <http://bugs.exim.org/1468> says libpcre

Oh! I had not read that. That is disappointing.

> intends to have undefined behavior here.  If so, it wouldn't help to wait
> until the next libprce release, which may well have a serious bug of this
> form in a different area, a bug that's not easy to test for.

Indeed.

> Perhaps somebody should modify grep -P to discard input lines containing
> non-UTF-8 data instead of presenting them to libprce.  That way, it would be
> safe for grep -P to use PCRE_NO_UTF8_CHECK.  Although grep -P should report
> an error and exit with status 2 if it discards input due to encoding errors,
> it can also report matches in lines that do not contain encoding errors, so
> that users can see both the error messages and the matches.

That sounds reasonable, but I don't like the requirement that
one make two passes over each subject text.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 23 May 2014 11:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 9 years and 363 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.