GNU bug report logs - #44351
Bug in grep v3.2 onwards in regular expression matching

Previous Next

Package: grep;

Reported by: Gonzalo Padrino <grimalg.on+gnu <at> gmail.com>

Date: Sat, 31 Oct 2020 16:17:02 UTC

Severity: normal

Merged with 44352

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 44351 in the body.
You can then email your comments to 44351 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#44351; Package grep. (Sat, 31 Oct 2020 16:17:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Gonzalo Padrino <grimalg.on+gnu <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Sat, 31 Oct 2020 16:17:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Gonzalo Padrino <grimalg.on+gnu <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: Bug in grep v3.2 onwards in regular expression matching
Date: Sat, 31 Oct 2020 14:14:55 +0100
[Message part 1 (text/plain, inline)]
Hello,

  While using GNU grep v3.4 in an Ubuntu 20.04 userspace running on top of
Win10 WSL (yeah, i know... but also checked in other envs) i discovered
what seems like an obvious bug (if i'm not mistaken).
  The bug:
-----
me <at> host:~$  echo 'xxxxy' |grep -E '^x+x+x+x+y$'
xxxxy
me <at> host:~$  echo 'xxxy' |grep -E '^x+x+x+x+y$'
xxxy
me <at> host:~$  echo 'xxy' |grep -E '^x+x+x+x+y$'
xxy
me <at> host:~$  echo 'xy' |grep -E '^x+x+x+x+y$'

----
...the terminal supports ansi color escapes, and what's really weird is
that only the result from the first command is colored in red. First and
fourth commands yield correct results; the second and third do not, as they
should not match it's input.

  I've tested releases from v3.1 to latest v3.5 and found the anomalous
behaviour in version v3.2 through v3.5. A (quick and clunky) git bisect led
me to believe it was introduced about two years ago, possibly in commit
123620af88f55c3e0cc9f0aed7311c72f625bc82 (
https://git.savannah.gnu.org/cgit/grep.git/commit/?id=123620af88f55c3e0cc9f0aed7311c72f625bc82).
If this is true, it would mean either the bug is in gnulib, or maybe grep
needed to do some kind of extra handling on it's side.

Kind regards. Gonzalo Padrino.

P.S.: I had to patch some things in order to successfully compile the code
after checking out some problematic commits (pragmas to avoid warnings
about "pure" and "noreturn" function attributes, a missing configmake
dependency in bootstrap.conf, etc ).
[Message part 2 (text/html, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#44351; Package grep. (Sat, 31 Oct 2020 16:48:02 GMT) Full text and rfc822 format available.

Message #8 received at 44351 <at> debbugs.gnu.org (full text, mbox):

From: arnold <at> skeeve.com
To: grimalg.on+gnu <at> gmail.com, 44351 <at> debbugs.gnu.org
Subject: Re: bug#44351: Bug in grep v3.2 onwards in regular expression matching
Date: Sat, 31 Oct 2020 10:47:22 -0600
I can reproduce this is gawk. :-(

It's a bug somewhere in the dfa matcher. When I export GAWK_NO_DFA=1
to bypass the dfa matcher, only xxxxy matches.

Hope this helps,

Arnold

Gonzalo Padrino <grimalg.on+gnu <at> gmail.com> wrote:

> Hello,
>
>   While using GNU grep v3.4 in an Ubuntu 20.04 userspace running on top of
> Win10 WSL (yeah, i know... but also checked in other envs) i discovered
> what seems like an obvious bug (if i'm not mistaken).
>   The bug:
> -----
> me <at> host:~$  echo 'xxxxy' |grep -E '^x+x+x+x+y$'
> xxxxy
> me <at> host:~$  echo 'xxxy' |grep -E '^x+x+x+x+y$'
> xxxy
> me <at> host:~$  echo 'xxy' |grep -E '^x+x+x+x+y$'
> xxy
> me <at> host:~$  echo 'xy' |grep -E '^x+x+x+x+y$'
>
> ----
> ...the terminal supports ansi color escapes, and what's really weird is
> that only the result from the first command is colored in red. First and
> fourth commands yield correct results; the second and third do not, as they
> should not match it's input.
>
>   I've tested releases from v3.1 to latest v3.5 and found the anomalous
> behaviour in version v3.2 through v3.5. A (quick and clunky) git bisect led
> me to believe it was introduced about two years ago, possibly in commit
> 123620af88f55c3e0cc9f0aed7311c72f625bc82 (
> https://git.savannah.gnu.org/cgit/grep.git/commit/?id=123620af88f55c3e0cc9f0aed7311c72f625bc82).
> If this is true, it would mean either the bug is in gnulib, or maybe grep
> needed to do some kind of extra handling on it's side.
>
> Kind regards. Gonzalo Padrino.
>
> P.S.: I had to patch some things in order to successfully compile the code
> after checking out some problematic commits (pragmas to avoid warnings
> about "pure" and "noreturn" function attributes, a missing configmake
> dependency in bootstrap.conf, etc ).




Information forwarded to bug-grep <at> gnu.org:
bug#44351; Package grep. (Sat, 31 Oct 2020 17:31:02 GMT) Full text and rfc822 format available.

Message #11 received at 44351 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Gonzalo Padrino <grimalg.on+gnu <at> gmail.com>
Cc: 44351 <at> debbugs.gnu.org
Subject: Re: bug#44351: Bug in grep v3.2 onwards in regular expression matching
Date: Sat, 31 Oct 2020 10:30:34 -0700
On Sat, Oct 31, 2020 at 9:17 AM Gonzalo Padrino
<grimalg.on+gnu <at> gmail.com> wrote:
> While using GNU grep v3.4 in an Ubuntu 20.04 userspace running on top of
> Win10 WSL (yeah, i know... but also checked in other envs) i discovered
> what seems like an obvious bug (if i'm not mistaken).
>   The bug:
> -----
> me <at> host:~$  echo 'xxxxy' |grep -E '^x+x+x+x+y$'
> xxxxy
> me <at> host:~$  echo 'xxxy' |grep -E '^x+x+x+x+y$'
> xxxy
> me <at> host:~$  echo 'xxy' |grep -E '^x+x+x+x+y$'
> xxy
> me <at> host:~$  echo 'xy' |grep -E '^x+x+x+x+y$'
>
> ----
> ...the terminal supports ansi color escapes, and what's really weird is
> that only the result from the first command is colored in red. First and
> fourth commands yield correct results; the second and third do not, as they
> should not match it's input.
>
>   I've tested releases from v3.1 to latest v3.5 and found the anomalous
> behaviour in version v3.2 through v3.5. A (quick and clunky) git bisect led
> me to believe it was introduced about two years ago, possibly in commit
> 123620af88f55c3e0cc9f0aed7311c72f625bc82 (
> https://git.savannah.gnu.org/cgit/grep.git/commit/?id=123620af88f55c3e0cc9f0aed7311c72f625bc82).
> If this is true, it would mean either the bug is in gnulib, or maybe grep
> needed to do some kind of extra handling on it's side.

Thank you for reporting that. I confirm this is a bug in the very latest.
This mistakenly matches:
  $ echo xxy |grep -E '^x+x+x+y$'
  xxy

That regular expression requires that any match have at least three
leading 'x's.

This is indeed due to a bug in gnulib's lib/dfa.c.

So far, I've found that we can band-aid fix it by disabling part of
merge_nfa_state's optimizations with this patch, but I do not propose
to make this change. This is just to show where the problem lies. I'm
pretty sure we can retain and correct the optimization.

diff --git a/lib/dfa.c b/lib/dfa.c
index 74aafa2ee..087c266c5 100644
--- a/lib/dfa.c
+++ b/lib/dfa.c
@@ -2459,7 +2459,7 @@ merge_nfa_state (struct dfa *d, idx_t tindex, char *flags,
                 continue;

               if (flags[sindex] & OPT_REPEAT)
-                delete (sindex, &follows[sindex]);
+                continue;

               merge2 (&follows[dindex], &follows[sindex], merged);




Merged 44351 44352. Request was from Jim Meyering <jim <at> meyering.net> to control <at> debbugs.gnu.org. (Sat, 31 Oct 2020 18:51:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#44351; Package grep. (Sun, 01 Nov 2020 07:43:02 GMT) Full text and rfc822 format available.

Message #16 received at 44351 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Jim Meyering <jim <at> meyering.net>
Cc: 44351 <at> debbugs.gnu.org, Gonzalo Padrino <grimalg.on+gnu <at> gmail.com>
Subject: Re: bug#44351: Bug in grep v3.2 onwards in regular expression matching
Date: Sun, 01 Nov 2020 16:42:21 +0900
[Message part 1 (text/plain, inline)]
Example,

  a+a+a
  1 2 3

position 1 has a repetition of "a" and other transition with "a".
position 2 has a repetition of "a" and other transition with "a", too.
Then DFA was merging the two nodes, but it is wrong.

Now similar nodes in series are not merged.
[0001-dfa-remain-similar-nodes-in-series-in-optimization.patch (text/plain, attachment)]
[0001-tests-add-the-test-for-bugfix-in-gnulib-s-dfa.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#44351; Package grep. (Sun, 01 Nov 2020 15:00:02 GMT) Full text and rfc822 format available.

Message #19 received at 44351 <at> debbugs.gnu.org (full text, mbox):

From: arnold <at> skeeve.com
To: noritnk <at> kcn.ne.jp, jim <at> meyering.net
Cc: 44351 <at> debbugs.gnu.org, grimalg.on+gnu <at> gmail.com
Subject: Re: bug#44351: Bug in grep v3.2 onwards in regular expression matching
Date: Sun, 01 Nov 2020 07:59:08 -0700
Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:

> Example,
>
>   a+a+a
>   1 2 3
>
> position 1 has a repetition of "a" and other transition with "a".
> position 2 has a repetition of "a" and other transition with "a", too.
> Then DFA was merging the two nodes, but it is wrong.
>
> Now similar nodes in series are not merged.

This patch works for me in gawk.

Looking forward to seeing it in Gnulib so that I can pull it in

Thanks!

Arnold




Information forwarded to bug-grep <at> gnu.org:
bug#44351; Package grep. (Sun, 01 Nov 2020 15:21:02 GMT) Full text and rfc822 format available.

Message #22 received at 44351 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 44351 <at> debbugs.gnu.org, Gonzalo Padrino <grimalg.on+gnu <at> gmail.com>
Subject: Re: bug#44351: Bug in grep v3.2 onwards in regular expression matching
Date: Sun, 1 Nov 2020 07:19:49 -0800
On Sun, Nov 1, 2020 at 12:42 AM Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> Example,
>
>   a+a+a
>   1 2 3
>
> position 1 has a repetition of "a" and other transition with "a".
> position 2 has a repetition of "a" and other transition with "a", too.
> Then DFA was merging the two nodes, but it is wrong.
>
> Now similar nodes in series are not merged.

Thank you for the quick work.
Would you please send a revised test patch? That one appears to be a
tiny delta added to a file that only you have locally. I.e., it
requires the new file tests/ere.tests, with 200+ lines.




Information forwarded to bug-grep <at> gnu.org:
bug#44351; Package grep. (Sun, 01 Nov 2020 15:23:02 GMT) Full text and rfc822 format available.

Message #25 received at 44351 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 44351 <at> debbugs.gnu.org, Gonzalo Padrino <grimalg.on+gnu <at> gmail.com>
Subject: Re: bug#44351: Bug in grep v3.2 onwards in regular expression matching
Date: Sun, 1 Nov 2020 07:22:39 -0800
On Sun, Nov 1, 2020 at 7:19 AM Jim Meyering <jim <at> meyering.net> wrote:
>
> On Sun, Nov 1, 2020 at 12:42 AM Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> > Example,
> >
> >   a+a+a
> >   1 2 3
> >
> > position 1 has a repetition of "a" and other transition with "a".
> > position 2 has a repetition of "a" and other transition with "a", too.
> > Then DFA was merging the two nodes, but it is wrong.
> >
> > Now similar nodes in series are not merged.
>
> Thank you for the quick work.
> Would you please send a revised test patch? That one appears to be a
> tiny delta added to a file that only you have locally. I.e., it
> requires the new file tests/ere.tests, with 200+ lines.

Oops. My mistake. I thought you were adding the test in gnulib and
hadded a new framework there.
The tests/ere.tests file is in grep. No problem. Thanks again.




Information forwarded to bug-grep <at> gnu.org:
bug#44351; Package grep. (Sun, 01 Nov 2020 15:32:02 GMT) Full text and rfc822 format available.

Message #28 received at 44351 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Jim Meyering <jim <at> meyering.net>
Cc: 44351 <at> debbugs.gnu.org, Gonzalo Padrino <grimalg.on+gnu <at> gmail.com>
Subject: Re: bug#44351: Bug in grep v3.2 onwards in regular expression matching
Date: Mon, 02 Nov 2020 00:30:55 +0900
Hi,

By the way, I was wondering whether to add the test to ere.tests or
spencer1.tests or to a new file.  How should they be used properly?





Information forwarded to bug-grep <at> gnu.org:
bug#44351; Package grep. (Sun, 01 Nov 2020 16:04:01 GMT) Full text and rfc822 format available.

Message #31 received at 44351 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 44351 <at> debbugs.gnu.org, Gonzalo Padrino <grimalg.on+gnu <at> gmail.com>
Subject: Re: bug#44351: Bug in grep v3.2 onwards in regular expression matching
Date: Sun, 1 Nov 2020 08:02:46 -0800
[Message part 1 (text/plain, inline)]
On Sun, Nov 1, 2020 at 7:31 AM Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> Hi,
> By the way, I was wondering whether to add the test to ere.tests or
> spencer1.tests or to a new file.  How should they be used properly?

Adding the new test in either place is fine, but there should be a comment.

Also, we need a NEWS entry. I'll add that separately.

More importantly, there must be a test in gnulib. I'm adding one with
the attached.
[0001-dfa-tests-test-for-invalid-merge-fix.patch (application/octet-stream, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#44351; Package grep. (Sun, 01 Nov 2020 16:06:01 GMT) Full text and rfc822 format available.

Message #34 received at 44351 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 44351 <at> debbugs.gnu.org, Gonzalo Padrino <grimalg.on+gnu <at> gmail.com>
Subject: Re: bug#44351: Bug in grep v3.2 onwards in regular expression matching
Date: Sun, 1 Nov 2020 08:05:12 -0800
On Sun, Nov 1, 2020 at 8:02 AM Jim Meyering <jim <at> meyering.net> wrote:
>
> On Sun, Nov 1, 2020 at 7:31 AM Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> > Hi,
> > By the way, I was wondering whether to add the test to ere.tests or
> > spencer1.tests or to a new file.  How should they be used properly?
>
> Adding the new test in either place is fine, but there should be a comment.
>
> Also, we need a NEWS entry. I'll add that separately.

My mention of NEWS here is ambiguous.
When I wrote that, I was thinking of grep's NEWS file.
Just after sending, I realized I must also mention the fix in gnulib's
NEWS file.
Amending shortly.

> More importantly, there must be a test in gnulib. I'm adding one with
> the attached.




Information forwarded to bug-grep <at> gnu.org:
bug#44351; Package grep. (Sun, 01 Nov 2020 16:11:01 GMT) Full text and rfc822 format available.

Message #37 received at 44351 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 44351 <at> debbugs.gnu.org, Gonzalo Padrino <grimalg.on+gnu <at> gmail.com>
Subject: Re: bug#44351: Bug in grep v3.2 onwards in regular expression matching
Date: Sun, 1 Nov 2020 08:09:51 -0800
On Sun, Nov 1, 2020 at 8:05 AM Jim Meyering <jim <at> meyering.net> wrote:
> On Sun, Nov 1, 2020 at 8:02 AM Jim Meyering <jim <at> meyering.net> wrote:
> >
> > On Sun, Nov 1, 2020 at 7:31 AM Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> > > Hi,
> > > By the way, I was wondering whether to add the test to ere.tests or
> > > spencer1.tests or to a new file.  How should they be used properly?
> >
> > Adding the new test in either place is fine, but there should be a comment.
> >
> > Also, we need a NEWS entry. I'll add that separately.
>
> My mention of NEWS here is ambiguous.
> When I wrote that, I was thinking of grep's NEWS file.
> Just after sending, I realized I must also mention the fix in gnulib's
> NEWS file.
> Amending shortly.

I hadn't read gnulib's NEWS file in too long. It's for things like API
changes, not bug fixes.




Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Sun, 01 Nov 2020 18:04:01 GMT) Full text and rfc822 format available.

Notification sent to Gonzalo Padrino <grimalg.on+gnu <at> gmail.com>:
bug acknowledged by developer. (Sun, 01 Nov 2020 18:04:02 GMT) Full text and rfc822 format available.

Message #42 received at 44351-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>, Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 44351-done <at> debbugs.gnu.org, Gonzalo Padrino <grimalg.on+gnu <at> gmail.com>
Subject: Re: bug#44351: Bug in grep v3.2 onwards in regular expression matching
Date: Sun, 1 Nov 2020 10:03:37 -0800
Thanks to all for the bug report and quick fix. Closing the bug report.




Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Sun, 01 Nov 2020 18:04:02 GMT) Full text and rfc822 format available.

Notification sent to Gonzalo Padrino <grimalgon <at> gmail.com>:
bug acknowledged by developer. (Sun, 01 Nov 2020 18:04:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#44351; Package grep. (Sun, 01 Nov 2020 18:20:02 GMT) Full text and rfc822 format available.

Message #50 received at 44351-done <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 44351-done <at> debbugs.gnu.org, Gonzalo Padrino <grimalg.on+gnu <at> gmail.com>,
 Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Subject: Re: bug#44351: Bug in grep v3.2 onwards in regular expression matching
Date: Sun, 1 Nov 2020 10:19:30 -0800
On Sun, Nov 1, 2020 at 10:03 AM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> Thanks to all for the bug report and quick fix. Closing the bug report.

Thanks for closing that. I've pushed the gnulib changes and am about
to push those for grep, too.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 30 Nov 2020 12:24:07 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 173 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.