GNU bug report logs -
#16586
grep: infinite loop in grep -P on some files with invalid UTF-8 sequences
Previous Next
Reported by: Santiago <santiago <at> debian.org>
Date: Wed, 29 Jan 2014 09:46:02 UTC
Severity: important
Found in version 2.16
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
Forwarded to Philip Hazel <ph10@hermes.cam.ac.uk>
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 16586 in the body.
You can then email your comments to 16586 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#16586
; Package
grep
.
(Wed, 29 Jan 2014 09:46:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Santiago <santiago <at> debian.org>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Wed, 29 Jan 2014 09:46:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Package: grep
Version: 2.16
Severity: important
Hi there,
I forward this bug from debian's BTS. Last changes in -P brought another
problem. I've confirmed this behavior on last debian package:
----- Forwarded message from Vincent Lefevre <vincent <at> vinc17.net> -----
[snip]
grep -P loops on some files with invalid UTF-8 sequences, e.g.
$ /usr/bin/printf "\xe9\x65\n\xab\n" | grep -P '.e|.?z' | head
�e
�e
�e
�e
�e
�e
�e
�e
�e
�e
(the infinite loop is interrupted here by a broken pipe due to
the "head").
It seems that the fix of
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=730472
didn't solve all the problems.
-- System Information:
Debian Release: jessie/sid
APT prefers unstable
APT policy: (500, 'unstable'), (500, 'testing'), (500, 'stable'), (1, 'experimental')
Architecture: amd64 (x86_64)
Foreign Architectures: i386
Kernel: Linux 3.12-1-amd64 (SMP w/2 CPU cores)
Locale: LANG=POSIX, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Versions of packages grep depends on:
ii dpkg 1.17.6
ii install-info 5.2.0.dfsg.1-2
ii libc6 2.17-97
ii libpcre3 1:8.31-2
grep recommends no packages.
grep suggests no packages.
-- no debconf information
----- End forwarded message -----
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16586
; Package
grep
.
(Mon, 03 Feb 2014 21:35:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 16586 <at> debbugs.gnu.org (full text, mbox):
On Wed, Jan 29, 2014 at 1:43 AM, Santiago <santiago <at> debian.org> wrote:
> Package: grep
> Version: 2.16
> Severity: important
>
> Hi there,
>
> I forward this bug from debian's BTS. Last changes in -P brought another
> problem. I've confirmed this behavior on last debian package:
>
> ----- Forwarded message from Vincent Lefevre <vincent <at> vinc17.net> -----
>
> [snip]
>
>
> grep -P loops on some files with invalid UTF-8 sequences, e.g.
>
> $ /usr/bin/printf "\xe9\x65\n\xab\n" | grep -P '.e|.?z' | head
> �e
> �e
> �e
> �e
> �e
> �e
> �e
> �e
> �e
> �e
>
> (the infinite loop is interrupted here by a broken pipe due to
> the "head").
>
> It seems that the fix of
>
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=730472
Thanks for the heads-up. That appears to be a problem with pcre.
I've just build grep (git head) against pcre (git head), and adjusted
your example slightly and built with gcc's address sanitizer mode.
Now, libpcre gets an internal segfault:
$ printf "\xe9\n\xab\n" > k; src/grep -P 'e|.?z' k
ASAN:SIGSEGV
=================================================================
==11821==ERROR: AddressSanitizer: SEGV on unknown address
0x62cfffffffff (pc 0x00\
00004f0743 sp 0x7fff6b32f4a0 bp 0x7fff6b32f760 T0)
#0 0x4f0742 in match /w/co/pcre/pcre_exec.c:5943
#1 0x4f26d5 in pcre_exec /w/co/pcre/pcre_exec.c:6941
#2 0x46f421 in Pexecute /w/co/grep/src/pcresearch.c:178
#3 0x4717a3 in do_execute /w/co/grep/src/main.c:1075
#4 0x4717a3 in grepbuf /w/co/grep/src/main.c:1111
#5 0x472249 in grep /w/co/grep/src/main.c:1222
#6 0x472249 in grepdesc /w/co/grep/src/main.c:1476
#7 0x4073ca in main /w/co/grep/src/main.c:2396
#8 0x7f6f21a53cdc in __libc_start_main (/lib64/libc.so.6+0x1ecdc)
#9 0x408a54 (/w/u/w/co/grep/src/grep+0x408a54)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /w/co/pcre/pcre_exec.c:5943 match
==11821==ABORTING
Sorry, but I don't have time to debug further. Quick glance suggests
it is backing up too far:
(gdb) b __asan_report_error
Breakpoint 1 at 0x448c40: file
../../.././libsanitizer/asan/asan_report.cc, line 711.
(gdb) r
Starting program: /w/u/w/co/grep/src/grep -P e\|.\?z k
warning: no loadable sections found in added symbol-file
system-supplied DSO at 0x7ffff7ffa000
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Program received signal SIGSEGV, Segmentation fault.
0x00000000004f0743 in match (eptr=0x62cfffffffff "",
ecode=0x60700000df8a "\035zx",
mstart=0x62d00000b002 "\253\n", '\276' <repeats 198 times>...,
offset_top=2, md=0x7fffffffce30, eptrb=0x0, rdepth=0)
at pcre_exec.c:5943
5943 BACKCHAR(eptr);
(gdb) l
5938 {
5939 if (eptr == pp) goto TAIL_RECURSE;
5940 RMATCH(eptr, ecode, offset_top, md, eptrb, RM46);
5941 if (rrc != MATCH_NOMATCH) RRETURN(rrc);
5942 eptr--;
5943 BACKCHAR(eptr);
5944 if (ctype == OP_ANYNL && eptr > pp && UCHAR21(eptr)
== CHAR_NL &&
5945 UCHAR21(eptr - 1) == CHAR_CR) eptr--;
5946 }
5947 }
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16586
; Package
grep
.
(Sat, 08 Mar 2014 23:08:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 16586 <at> debbugs.gnu.org (full text, mbox):
For what it's worth I can't reproduce this bug on Fedora 20 x86-64, even
with valgrind and/or GCC -faddress=sanitize. I'm using Fedora
pcre-8.33-4.fc20.x86_64.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16586
; Package
grep
.
(Tue, 15 Apr 2014 14:11:01 GMT)
Full text and
rfc822 format available.
Message #14 received at 16586 <at> debbugs.gnu.org (full text, mbox):
On Sat, Mar 08, 2014 at 03:07:00PM -0800, Paul Eggert wrote:
> For what it's worth I can't reproduce this bug on Fedora 20 x86-64,
> even with valgrind and/or GCC -faddress=sanitize. I'm using Fedora
> pcre-8.33-4.fc20.x86_64.
>
Indeed, it was a debian-pcre-specific bug. New pcre package (1:8.31-3)
enables JIT regex compilation and solves the issue.
I'm updating grep's dependencies to close this bug in debian.
Regards,
Santiago
Reply sent
to
Paul Eggert <eggert <at> cs.ucla.edu>
:
You have taken responsibility.
(Tue, 15 Apr 2014 14:50:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Santiago <santiago <at> debian.org>
:
bug acknowledged by developer.
(Tue, 15 Apr 2014 14:50:03 GMT)
Full text and
rfc822 format available.
Message #19 received at 16586-done <at> debbugs.gnu.org (full text, mbox):
Santiago wrote:
> it was a debian-pcre-specific bug.
Thanks, closing the bug upstream.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16586
; Package
grep
.
(Wed, 16 Apr 2014 16:26:01 GMT)
Full text and
rfc822 format available.
Message #22 received at 16586 <at> debbugs.gnu.org (full text, mbox):
On Tue, Apr 15, 2014 at 7:48 AM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> Santiago wrote:
>> it was a debian-pcre-specific bug.
>
> Thanks, closing the bug upstream.
This bug is still present in upstream libpcre version 8.35.
I wrote a patch for it, posted at http://debbugs.gnu.org/17245#26
and Norihiro forwarded it on to the libpcre bug tracker here:
http://bugs.exim.org/show_bug.cgi?id=1468
Set bug forwarded-to-address to 'Philip Hazel <ph10 <at> hermes.cam.ac.uk>'.
Request was from
Paul Eggert <eggert <at> cs.ucla.edu>
to
control <at> debbugs.gnu.org
.
(Wed, 16 Apr 2014 17:44:02 GMT)
Full text and
rfc822 format available.
Did not alter fixed versions and reopened.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Wed, 16 Apr 2014 17:48:01 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16586
; Package
grep
.
(Wed, 16 Apr 2014 17:51:02 GMT)
Full text and
rfc822 format available.
Message #29 received at 16586 <at> debbugs.gnu.org (full text, mbox):
Jim Meyering wrote:
> This bug is still present in upstream libpcre version 8.35.
Ah, sorry, I thought it was Debian-specific. I've reopened grep bug
16586 <http://bugs.gnu.org/16586>, and have forwarded it to Philip
Hazel, who currently has the PCRE bug assigned, according to
<http://bugs.exim.org/1468>.
Reply sent
to
Paul Eggert <eggert <at> cs.ucla.edu>
:
You have taken responsibility.
(Mon, 21 Apr 2014 18:04:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Santiago <santiago <at> debian.org>
:
bug acknowledged by developer.
(Mon, 21 Apr 2014 18:04:03 GMT)
Full text and
rfc822 format available.
Message #34 received at 16586-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 04/16/2014 05:13 AM, Norihiro Tanaka wrote:
> http://bugs.exim.org/show_bug.cgi?id=1468
Thanks. The response there makes it clear that if grep passes arbitrary
binary data to PCRE, and if grep uses PCRE_NO_UTF8_CHECK, undefined
behavior will result (maybe infinite loop, core dump, etc.). We can't
have undefined behavior in grep. A simple fix is to avoid using
PCRE_NO_UTF8_CHECK so I installed the attached patch to do that.
Perhaps we can think of a better way at some point. In the meantime I'm
taking the liberty of closing Bug#17245 and Bug#16586.
[0001-grep-P-now-rejects-invalid-input-sequences-in-UTF-8-.patch (text/x-patch, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16586
; Package
grep
.
(Thu, 24 Apr 2014 02:32:01 GMT)
Full text and
rfc822 format available.
Message #37 received at 16586 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Mon, Apr 21, 2014 at 11:03 AM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 04/16/2014 05:13 AM, Norihiro Tanaka wrote:
>>
>> http://bugs.exim.org/show_bug.cgi?id=1468
>
>
> Thanks. The response there makes it clear that if grep passes arbitrary
> binary data to PCRE, and if grep uses PCRE_NO_UTF8_CHECK, undefined behavior
> will result (maybe infinite loop, core dump, etc.). We can't have undefined
> behavior in grep. A simple fix is to avoid using PCRE_NO_UTF8_CHECK so I
> installed the attached patch to do that. Perhaps we can think of a better
> way at some point. In the meantime I'm taking the liberty of closing
> Bug#17245 and Bug#16586.
Thanks for the patch, but I'm not sure I like the consequences:
that anyone using grep -P to search data that is even a tiny bit
inconsistent with their UTF-8 locale will now get an exit status of
2 rather than the matches they used to get. I would prefer to test for
working PCRE support and disable -P if it is deemed inadequate,
but that may have to wait for the release of a new version of
libpcre.
In any case, I found that this additional change is required,
at least on OS/X, to avoid a test failure:
[k.txt (text/plain, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16586
; Package
grep
.
(Thu, 24 Apr 2014 02:32:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16586
; Package
grep
.
(Thu, 24 Apr 2014 05:40:02 GMT)
Full text and
rfc822 format available.
Message #43 received at 16586 <at> debbugs.gnu.org (full text, mbox):
Jim Meyering wrote:
> anyone using grep -P to search data that is even a tiny bit
> inconsistent with their UTF-8 locale will now get an exit status of
> 2 rather than the matches they used to get.
Yes, I don't like that either, but <http://bugs.exim.org/1468> says
libpcre intends to have undefined behavior here. If so, it wouldn't
help to wait until the next libprce release, which may well have a
serious bug of this form in a different area, a bug that's not easy to
test for.
Perhaps somebody should modify grep -P to discard input lines containing
non-UTF-8 data instead of presenting them to libprce. That way, it
would be safe for grep -P to use PCRE_NO_UTF8_CHECK. Although grep -P
should report an error and exit with status 2 if it discards input due
to encoding errors, it can also report matches in lines that do not
contain encoding errors, so that users can see both the error messages
and the matches.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16586
; Package
grep
.
(Thu, 24 Apr 2014 15:30:02 GMT)
Full text and
rfc822 format available.
Message #46 received at 16586 <at> debbugs.gnu.org (full text, mbox):
On Wed, Apr 23, 2014 at 10:39 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> Jim Meyering wrote:
>>
>> anyone using grep -P to search data that is even a tiny bit
>> inconsistent with their UTF-8 locale will now get an exit status of
>> 2 rather than the matches they used to get.
>
>
> Yes, I don't like that either, but <http://bugs.exim.org/1468> says libpcre
Oh! I had not read that. That is disappointing.
> intends to have undefined behavior here. If so, it wouldn't help to wait
> until the next libprce release, which may well have a serious bug of this
> form in a different area, a bug that's not easy to test for.
Indeed.
> Perhaps somebody should modify grep -P to discard input lines containing
> non-UTF-8 data instead of presenting them to libprce. That way, it would be
> safe for grep -P to use PCRE_NO_UTF8_CHECK. Although grep -P should report
> an error and exit with status 2 if it discards input due to encoding errors,
> it can also report matches in lines that do not contain encoding errors, so
> that users can see both the error messages and the matches.
That sounds reasonable, but I don't like the requirement that
one make two passes over each subject text.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Fri, 23 May 2014 11:24:05 GMT)
Full text and
rfc822 format available.
This bug report was last modified 9 years and 363 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.