GNU bug report logs - #26864
Clarification on obscure regular expressions mentioned in known bugs

Previous Next

Package: grep;

Reported by: Sundeep Agarwal <learnbyexample.net <at> gmail.com>

Date: Wed, 10 May 2017 15:31:02 UTC

Severity: normal

Merged with 36148

To reply to this bug, email your comments to 26864 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#26864; Package grep. (Wed, 10 May 2017 15:31:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Sundeep Agarwal <learnbyexample.net <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Wed, 10 May 2017 15:31:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Sundeep Agarwal <learnbyexample.net <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: Clarification on obscure regular expressions mentioned in known bugs
Date: Wed, 10 May 2017 14:18:37 +0530
[Message part 1 (text/plain, inline)]
Hello,

From the man page, version 'grep (GNU grep) 2.25'

--------------------------------
   Known Bugs
       Large repetition counts in the {n,m} construct may cause grep to use
lots of  memory.   In  addition,  certain other obscure regular expressions
require exponential time and space, and may cause grep to run out of memory.
       Back-references are very slow, and may require exponential time.
--------------------------------

I was trying a regular expression to find words from dictionary that have
two different instances of repeated letters, for example the words:
misspellings, chilliness, woodcutter etc

$ # gives no output
$ grep -m5 -xiE '([a-z]*([a-z])\2[a-z]*){2}' /usr/share/dict/words

$ # works as expected with PCRE
$ grep -m5 -xiP '([a-z]*([a-z])\2[a-z]*){2}' /usr/share/dict/words
Abbott
Annabelle
Annette
Appaloosa
Appleseed


I asked regarding this on
https://stackoverflow.com/questions/43572924/ere-adding-quantifier-to-group-with-inner-group-and-back-reference
and other forums. It helped to identify some more cases like

$ echo 'aazbbycc' | grep -E '(([a-z])\2[a-z]*){2}'
aazbbycc
$ # no output
$ echo 'aazbbycc' | grep -E '(([a-z])\2[a-z]*){3}'

and

$ echo 'aazbbycc' | grep -E '(([a-z])\2[a-z]{0,3}){3}'
aazbbycc
$ # no output
$ echo 'aazbbycc' | grep -E '(([a-z])\2[a-z]{0,4}){3}'

$ # seems dependent on character class clashing with back reference
characters and quantifier count
$ # a, b and c are the characters matching back reference
$ echo 'aazbbycc' | grep -E '(([a-z])\2[abcyz]{0,4}){2}'
aazbbycc
$ echo 'aazbbycc' | grep -E '(([a-z])\2[abcyz]{0,4}){3}'
$ echo 'aazbbycc' | grep -E '(([a-z])\2[abyz]{0,4}){3}'
$ echo 'aazbbycc' | grep -E '(([a-z])\2[byz]{0,4}){3}'
$ echo 'aazbbycc' | grep -E '(([a-z])\2[acyz]{0,4}){3}'
aazbbycc
$ echo 'aazbbycc' | grep -E '(([a-z])\2[ayz]{0,4}){3}'
aazbbycc
$ echo 'aazbbycc' | grep -E '(([a-z])\2[cyz]{0,4}){3}'
aazbbycc
$ echo 'aazbbycc' | grep -E '(([a-z])\2[yz]{0,4}){3}'
aazbbycc

The same behavior is seen with 'sed (GNU sed) 4.2.2' as well. For ex:

$ echo 'aazbbycc' | sed -nE '/(([a-z])\2[a-z]{0,3}){3}/p'
aazbbycc
$ # no output
$ echo 'aazbbycc' | sed -nE '/(([a-z])\2[a-z]{0,4}){3}/p'


So, my question is whether these regular expression examples come under
'obscure regular expressions' mentioned in the man page. If so, I feel
there should be an error message displayed instead of no output

Regards,
Sundeep
[Message part 2 (text/html, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#26864; Package grep. (Mon, 30 Dec 2019 09:02:01 GMT) Full text and rfc822 format available.

Message #8 received at 26864 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Sundeep Agarwal <learnbyexample.net <at> gmail.com>
Cc: 26864 <at> debbugs.gnu.org
Subject: Re: bug#26864: Clarification on obscure regular expressions mentioned
 in known bugs
Date: Mon, 30 Dec 2019 01:01:21 -0800
[Message part 1 (text/plain, inline)]
On 5/10/17 1:48 AM, Sundeep Agarwal wrote:

> my question is whether these regular expression examples come under
> 'obscure regular expressions' mentioned in the man page.

No, they're bugs in grep's regular expression implementation, which is taken
from glibc. I filed a bug report against glibc here:

https://sourceware.org/bugzilla/show_bug.cgi?id=25322

and installed the attached patches to GNU grep to try to document this mess
better. The first patch is the important one; the other two merely standardize
spelling and fix a typo in the first patch. Thanks for reporting the problem.
[0001-doc-mention-back-reference-bugs.patch (text/x-patch, attachment)]
[0002-doc-spell-back-reference-more-consistently.patch (text/x-patch, attachment)]
[0003-doc-fix-bug-typo.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#26864; Package grep. (Mon, 30 Dec 2019 18:52:02 GMT) Full text and rfc822 format available.

Message #11 received at 26864 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Sundeep Agarwal <learnbyexample.net <at> gmail.com>
Cc: 26864 <at> debbugs.gnu.org
Subject: Re: bug#26864: Clarification on obscure regular expressions mentioned
 in known bugs
Date: Mon, 30 Dec 2019 10:51:16 -0800
[Message part 1 (text/plain, inline)]
On further thought, we shouldn't be encouraging palindromish REs in the manual,
so that naive users aren't drawn into this mess. So I installed the attached
further patch to the documentation.

Of course it would be better to fix the back-reference bugs but this is low
priority and I doubt whether anybody has the time.
[0001-doc-don-t-encourage-back-references.patch (text/x-patch, attachment)]

Merged 26864 36148. Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Thu, 02 Jan 2020 09:35:01 GMT) Full text and rfc822 format available.

This bug report was last modified 4 years and 86 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.