GNU bug report logs - #39678
'grep --ignore-case --color' does not always color the matches

Previous Next

Package: grep;

Reported by: Benno Schulenberg <bensberg <at> telfort.nl>

Date: Wed, 19 Feb 2020 15:28:01 UTC

Severity: normal

Merged with 51255, 51256, 51257

To reply to this bug, email your comments to 39678 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#39678; Package grep. (Wed, 19 Feb 2020 15:28:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Benno Schulenberg <bensberg <at> telfort.nl>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Wed, 19 Feb 2020 15:28:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Benno Schulenberg <bensberg <at> telfort.nl>
To: bug-grep <at> gnu.org
Cc: bpaddis <at> gmail.com
Subject: 'grep --ignore-case --color' does not always color the matches
Date: Wed, 19 Feb 2020 11:18:18 +0100
[Message part 1 (text/plain, inline)]
  echo a | grep -i --color '\a'

  echo a | grep -i --color '\A'

Of the above two commands, only the second colorizes the printed "a".

(An old GNU grep on NetBSD (grep 2.5.1a nb1) does the opposite of
a modern grep: it will show color for \a, but none for \A.)

Bug was found in GNU grep 3.1 while checking whether it understands
\d as a shorthand for [0-9].  Still present in 3.4. The locale does
not appear to matter.


In a report against glibc [1] that seems to be related, a comment
says that "Unknown backslash escapes invoke undefined behaviour."
But where in the documentation does it say so?


When searching for a regular expression in GNU nano (^W M-R), nano says
it cannot find any \a.  But for \A, it will find all "a"s and "A"s.
Nano's default search is case insensitive.  See [2] for the original
report by Ben Addis.  I suppose this is the same or a related bug in
the regex module of gnulib, which nano uses.


[1] https://sourceware.org/bugzilla/show_bug.cgi?id=22425

[2] https://savannah.gnu.org/bugs/?57852

Benno

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#39678; Package grep. (Mon, 18 Oct 2021 04:00:02 GMT) Full text and rfc822 format available.

Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Tomasz Dziendzielski <tomasz.dziendzielski <at> gmail.com>
To: bug-grep <at> gnu.org, bensberg <at> telfort.nl
Subject: bug#39678: 'grep --ignore-case --color' does not always color the
 matches
Date: Sun, 17 Oct 2021 22:38:27 +0200
[Message part 1 (text/plain, inline)]
Hello,
I did some investigation and it seems the problem with "\a" is not that the
output is not colored but that it's printed at all while it should not be.
It's being printed even when re_search in EGexecute doesn't return a match.
As a result the for loop in print_line_middle is not being run - this means
the "a" output that one can see is not from the matched string but from the
rest of the line. To fix this I suggest to return NULL from
print_line_middle if `b` variable equals NULL - which means the loop was
not executed.

I'm sending the patch with the fix in the attachment.

Best regards,
Tomasz Dziendzielski
[Message part 2 (text/html, inline)]
[0001-grep-Don-t-print-line-if-matcher-returned-nothing.patch (application/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#39678; Package grep. (Mon, 18 Oct 2021 04:00:05 GMT) Full text and rfc822 format available.

Message #11 received at 39678 <at> debbugs.gnu.org (full text, mbox):

From: Tomasz Dziendzielski <tomasz.dziendzielski <at> gmail.com>
To: bensberg <at> telfort.nl, bug-grep <at> gnu.org
Cc: 39678 <at> debbugs.gnu.org
Subject: #39678: grep --ignore-case --color does not always color the matches
Date: Sun, 17 Oct 2021 23:37:30 +0200
[Message part 1 (text/plain, inline)]
Hello,
I did some investigation and it seems the problem with "\a" is not that the
output is not colored but that it's printed at all while it should not be.
It's being printed even when re_search in EGexecute doesn't return a match.
As a result the for loop in print_line_middle is not being run - this means
the "a" output that you see is not from the matched string but from the
rest of the line. To fix this I suggest to return NULL from
print_line_middle if `b` variable equals NULL - which means the loop was
not executed.

I'm sending the patch with the fix in the attachment.

Best regards,
Tomasz Dziendzielski
[Message part 2 (text/html, inline)]
[0001-grep-Don-t-print-line-if-matcher-returned-nothing.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#39678; Package grep. (Mon, 18 Oct 2021 04:00:06 GMT) Full text and rfc822 format available.

Message #14 received at 39678 <at> debbugs.gnu.org (full text, mbox):

From: Tomasz Dziendzielski <tomasz.dziendzielski <at> gmail.com>
To: bug-grep <at> gnu.org
Cc: 39678 <at> debbugs.gnu.org
Subject: grep --ignore-case --color does not always color the matches
Date: Mon, 18 Oct 2021 00:02:36 +0200
[Message part 1 (text/plain, inline)]
Hello,
I did some investigation and it seems the problem with "\a" is not that the
output is not colored but that it's printed at all while it should not be.
It's being printed even when re_search in EGexecute doesn't return a match.
As a result the for loop in print_line_middle is not being run - this means
the "a" output that you see is not from the matched string but from the
rest of the line. To fix this I suggest to return NULL from
print_line_middle if `b` variable equals NULL - which means the loop was
not executed.

I'm sending the patch with the fix in the attachment.

Best regards,
Tomasz Dziendzielski
[Message part 2 (text/html, inline)]
[0001-grep-Don-t-print-line-if-matcher-returned-nothing.patch (application/octet-stream, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#39678; Package grep. (Mon, 18 Oct 2021 04:00:06 GMT) Full text and rfc822 format available.

Message #17 received at 39678 <at> debbugs.gnu.org (full text, mbox):

From: Tomasz Dziendzielski <tomasz.dziendzielski <at> gmail.com>
To: 39678 <at> debbugs.gnu.org
Subject: 'grep --ignore-case --color' does not always color the matches
Date: Mon, 18 Oct 2021 00:15:05 +0200
[Message part 1 (text/plain, inline)]
Hello,
I did some investigation and it seems the problem with "\a" is not that the
output is not colored but that it's printed at all while it should not be.
It's being printed even when re_search in EGexecute doesn't return a match.
As a result the for loop in print_line_middle is not being run - this means
the "a" output that you see is not from the matched string but from the
rest of the line. To fix this I suggest to return NULL from
print_line_middle if `b` variable equals NULL - which means the loop was
not executed.

I'm sending the patch with the fix in the attachment.

Best regards,
Tomasz Dziendzielski
[Message part 2 (text/html, inline)]
[0001-grep-Don-t-print-line-if-matcher-returned-nothing.patch (application/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#39678; Package grep. (Mon, 18 Oct 2021 04:00:07 GMT) Full text and rfc822 format available.

Message #20 received at 39678 <at> debbugs.gnu.org (full text, mbox):

From: Tomasz Dziendzielski <tomasz.dziendzielski <at> gmail.com>
To: bug-grep <at> gnu.org
Cc: 39678 <at> debbugs.gnu.org
Subject: grep --ignore-case --color does not always color the matches
Date: Mon, 18 Oct 2021 00:26:41 +0200
[Message part 1 (text/plain, inline)]
Hello,
I did some investigation and it seems the problem with '\a' is not that the output is not colored but that it's printed at all while it should not be. It's being printed even when re_search in EGexecute doesn't return a match. As a result the for loop in print_line_middle is not being run - this means the 'a' output that you see is not from the matched string but from the rest of the line. To fix this I suggest to return NULL from print_line_middle if 'b' variable equals NULL - which means the loop was not executed.

I'm sending the patch with the fix in the attachment.

Best regards,
Tomasz Dziendzielski
[Message part 2 (text/html, inline)]
[0001-grep-Don-t-print-line-if-matcher-returned-nothing.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#39678; Package grep. (Mon, 18 Oct 2021 04:09:01 GMT) Full text and rfc822 format available.

Message #23 received at 39678 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Tomasz Dziendzielski <tomasz.dziendzielski <at> gmail.com>
Cc: 39678 <at> debbugs.gnu.org
Subject: Re: bug#39678: 'grep --ignore-case --color' does not always color the
 matches
Date: Sun, 17 Oct 2021 21:08:30 -0700
On 10/17/21 15:15, Tomasz Dziendzielski wrote:
> It's being printed even when re_search in EGexecute doesn't return a match.

Gexecute should return a match, though, right? This is because the 
pattern '\a' matches the data 'a'. So it sounds like the bug is in 
EGexecute somewhere, not in its caller.




Information forwarded to bug-grep <at> gnu.org:
bug#39678; Package grep. (Mon, 18 Oct 2021 04:46:02 GMT) Full text and rfc822 format available.

Message #26 received at 39678 <at> debbugs.gnu.org (full text, mbox):

From: Tomasz Dziendzielski <tomasz.dziendzielski <at> gmail.com>
To: 39678 <at> debbugs.gnu.org
Subject: Fwd: 'grep --ignore-case --color' does not always color the matches
Date: Mon, 18 Oct 2021 06:44:48 +0200
[Message part 1 (text/plain, inline)]
Hi,
sorry for producing such a spam, there was huge 7h delay between my mail
and the reply from the mailing list, so I thought something was not sent
correctly on my side.

>EGexecute should return a match, though,
>right? The pattern '\a' matches
>the data 'a'. So the bug is in EGexecute >somewhere, not in its caller.
Thanks for the reply. In my opinion the bug is not in the EGexecute, since
it uses re_search (from gnulib) and the re_search is not returning a match.
So there is no problem in EGexecute. I also compared python re.search with
\a also don't return a match and I found \a can be interpreted a bell or
alarm.
It's good to mention that the lack of match happens for all lowercase
characters that are not defined in regex (for example echo "j" | grep -i
--color '\j'. Comparing to python, they don't allow us to escape any
incorrect lowercase characters, for example:
>>>re.search("\j","j", re.IGNORECASE)
re.error: bad escape \j at position 0
Due to that I think grep should also ignore the non-maching/bad character
instead of printing them.

Best regards,
Tomasz Dziendzielski
[Message part 2 (text/html, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#39678; Package grep. (Mon, 18 Oct 2021 04:52:01 GMT) Full text and rfc822 format available.

Message #29 received at 39678 <at> debbugs.gnu.org (full text, mbox):

From: Tomasz Dziendzielski <tomasz.dziendzielski <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 39678 <at> debbugs.gnu.org
Subject: Re: bug#39678: 'grep --ignore-case --color' does not always color the
 matches
Date: Mon, 18 Oct 2021 06:53:01 +0200
[Message part 1 (text/plain, inline)]
>EGexecute should return a match, though,
>right? The pattern '\a' matches
>the data 'a'. So the bug is in EGexecute >somewhere, not in its caller.
Thanks for the reply. In my opinion the bug is not in the EGexecute, since
it uses re_search (from gnulib) and the re_search is not returning a match.
So there is no problem in EGexecute. I also compared python re.search with
\a also don't return a match and I found \a can be interpreted a bell or
alarm.
It's good to mention that the lack of match happens for all lowercase
characters that are not defined in regex (for example echo "j" | grep -i
--color '\j'. Comparing to python, they don't allow us to escape any
incorrect lowercase characters, for example:
>>>re.search("\j","j", re.IGNORECASE)
re.error: bad escape \j at position 0
Due to that I think grep should also ignore the non-maching/bad character
instead of printing them.

Best regards,
Tomasz Dziendzielski

pon., 18 paź 2021 o 06:08 Paul Eggert <eggert <at> cs.ucla.edu> napisał(a):

> On 10/17/21 15:15, Tomasz Dziendzielski wrote:
> > It's being printed even when re_search in EGexecute doesn't return a
> match.
>
> Gexecute should return a match, though, right? This is because the
> pattern '\a' matches the data 'a'. So it sounds like the bug is in
> EGexecute somewhere, not in its caller.
>
[Message part 2 (text/html, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#39678; Package grep. (Mon, 18 Oct 2021 05:12:02 GMT) Full text and rfc822 format available.

Message #32 received at 39678 <at> debbugs.gnu.org (full text, mbox):

From: Tomasz Dziendzielski <tomasz.dziendzielski <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 39678 <at> debbugs.gnu.org
Subject: Re: bug#39678: 'grep --ignore-case --color' does not always color the
 matches
Date: Mon, 18 Oct 2021 07:10:44 +0200
[Message part 1 (text/plain, inline)]
Also, even if we assume it should return a match (which I think it
shouldn't), then there's a second bug in print_line_middle, because why
would it print anything if there was no match?

Best regards,
Tomasz Dziendzielski
[Message part 2 (text/html, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#39678; Package grep. (Mon, 18 Oct 2021 13:38:01 GMT) Full text and rfc822 format available.

Message #35 received at 39678 <at> debbugs.gnu.org (full text, mbox):

From: Tomasz Dziendzielski <tomasz.dziendzielski <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 39678 <at> debbugs.gnu.org
Subject: Re: bug#39678: 'grep --ignore-case --color' does not always color the
 matches
Date: Mon, 18 Oct 2021 15:39:03 +0200
[Message part 1 (text/plain, inline)]
Additional note, according to POSIX regex standard:
"An ordinary character is a BRE that matches itself: any character in the
supported character set, except for the BRE special characters listed in
BRE Special Characters.
The interpretation of an ordinary character preceded by an unescaped
<backslash> ( '\\' ) is undefined, except for: [...]"
which means that escaping ordinary characters (like \a) has undefined
behaviour, thus I think it should not be matched.
So having EGexecute match 'a' to '\a' is incorrect, and instead we should
remove the matching and make sure it's not printed. That would make my
patch correct.

Reference:
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_03

Looking forward to feedback.

Best regards,
Tomasz Dziendzielski

pon., 18 paź 2021 o 07:10 Tomasz Dziendzielski <
tomasz.dziendzielski <at> gmail.com> napisał(a):

> Also, even if we assume it should return a match (which I think it
> shouldn't), then there's a second bug in print_line_middle, because why
> would it print anything if there was no match?
>
> Best regards,
> Tomasz Dziendzielski
>
[Message part 2 (text/html, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#39678; Package grep. (Mon, 18 Oct 2021 15:41:02 GMT) Full text and rfc822 format available.

Message #38 received at 39678 <at> debbugs.gnu.org (full text, mbox):

From: Tomasz Dziendzielski <tomasz.dziendzielski <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 39678 <at> debbugs.gnu.org
Subject: Re: bug#39678: 'grep --ignore-case --color' does not always color the
 matches
Date: Mon, 18 Oct 2021 17:39:49 +0200
[Message part 1 (text/plain, inline)]
Now I noticed that it introduces inconsistency, but due to a second problem
- that grep prints lines even when there were no matching strings. For
example if I run:
$ echo "a" | grep -E "{1,3}"
$ echo "a" | grep -i "\a"
then it returns the string "a" even though it doesn't match anything. With
my patch it correctly does not print the string but I think for consistency
it also requires adaptations in case --color was not passed. I'll try to
work on that.

Best regards,
Tomasz Dziendzielski
[Message part 2 (text/html, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#39678; Package grep. (Mon, 18 Oct 2021 16:12:02 GMT) Full text and rfc822 format available.

Message #41 received at 39678 <at> debbugs.gnu.org (full text, mbox):

From: Tomasz Dziendzielski <tomasz.dziendzielski <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 39678 <at> debbugs.gnu.org
Subject: Re: bug#39678: 'grep --ignore-case --color' does not always color the
 matches
Date: Mon, 18 Oct 2021 18:11:34 +0200
[Message part 1 (text/plain, inline)]
A different inconsistency is in gnulib regex also, which matches 'a' to
'\a' but do not match the same if case insensitive is set. I'll still try
to work on that.

Best regards,
Tomasz Dziendzielski

pon., 18 paź 2021, 17:39 użytkownik Tomasz Dziendzielski <
tomasz.dziendzielski <at> gmail.com> napisał:

> Now I noticed that it introduces inconsistency, but due to a second
> problem - that grep prints lines even when there were no matching strings.
> For example if I run:
> $ echo "a" | grep -E "{1,3}"
> $ echo "a" | grep -i "\a"
> then it returns the string "a" even though it doesn't match anything. With
> my patch it correctly does not print the string but I think for consistency
> it also requires adaptations in case --color was not passed. I'll try to
> work on that.
>
> Best regards,
> Tomasz Dziendzielski
>
[Message part 2 (text/html, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#39678; Package grep. (Mon, 18 Oct 2021 17:00:02 GMT) Full text and rfc822 format available.

Message #44 received at 39678 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Tomasz Dziendzielski <tomasz.dziendzielski <at> gmail.com>
Cc: 39678 <at> debbugs.gnu.org
Subject: Re: bug#39678: 'grep --ignore-case --color' does not always color the
 matches
Date: Mon, 18 Oct 2021 09:59:19 -0700
On 10/17/21 21:53, Tomasz Dziendzielski wrote:
> Thanks for the reply. In my opinion the bug is not in the EGexecute, since
> it uses re_search (from gnulib) and the re_search is not returning a match.
> So there is no problem in EGexecute.

It seems to me that there is a bug, because the search in question is 
done to a line for which EGexecute has already reported a match. 
Therefore, if you search again with the same pattern, you should find at 
least one match. However, in this buggy case, EGexecute is not finding a 
match, and this messes up the caller.

> even if we assume it should return a match (which I think it
> shouldn't), then there's a second bug in print_line_middle, because why
> would it print anything if there was no match?

It should be able to assume that there is at least one match, so it 
shouldn't need to worry about the case where there are zero matches.

> which means that escaping ordinary characters (like \a) has undefined
> behaviour,

Yes, and that means this bug is not a POSIX-conformance issue. However, 
it's still a bug, since '\a' should match 'a' in GNU grep, at least 
unless and until we decide that \a matches something else.

> So having EGexecute match 'a' to '\a' is incorrect

Not at all. Undefined behavior means 'grep' can do anything it likes, as 
far as POSIX is concerned.





Information forwarded to bug-grep <at> gnu.org:
bug#39678; Package grep. (Tue, 19 Oct 2021 18:38:01 GMT) Full text and rfc822 format available.

Message #47 received at 39678 <at> debbugs.gnu.org (full text, mbox):

From: Tomasz Dziendzielski <tomasz.dziendzielski <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 39678 <at> debbugs.gnu.org
Subject: Re: bug#39678: 'grep --ignore-case --color' does not always color the
 matches
Date: Tue, 19 Oct 2021 20:36:43 +0200
[Message part 1 (text/plain, inline)]
I found that for optimization purposes common cases in re_string_peek_byte
are missed if pstr->mbs_allocated is true, thus skipping them also for case
insensitive.

My solution would be to return re_string_peek_byte also if pstr->icase is
true. mbs_allocated is changed also depending on icase but I don't think we
want to change it globally, so I think my patch should be fine and not
affect optimization.

Please check the attachment. If the patch is fine should I also send this
to gnulib mailing list to have it submitted or is this one enough?

Best regards,
Tomasz Dziendzielski
[Message part 2 (text/html, inline)]
[0001-regex-Handle-common-easiest-cases-if-case-insensitiv.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#39678; Package grep. (Tue, 19 Oct 2021 23:24:02 GMT) Full text and rfc822 format available.

Message #50 received at 39678 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Tomasz Dziendzielski <tomasz.dziendzielski <at> gmail.com>
Cc: 39678 <at> debbugs.gnu.org
Subject: Re: bug#39678: 'grep --ignore-case --color' does not always color the
 matches
Date: Tue, 19 Oct 2021 16:23:41 -0700
On 10/19/21 11:36, Tomasz Dziendzielski wrote:
> I found that for optimization purposes common cases in re_string_peek_byte
> are missed if pstr->mbs_allocated is true, thus skipping them also for case
> insensitive.

Unfortunately I've lost context here. Is this patch being proposed to 
improve performance, or to fix a bug?

I don't see how the patch preserves correctness: if pstr->icase is true, 
surely in general we need to look at the translated string, not the 
original.




Information forwarded to bug-grep <at> gnu.org:
bug#39678; Package grep. (Wed, 20 Oct 2021 01:11:02 GMT) Full text and rfc822 format available.

Message #53 received at 39678 <at> debbugs.gnu.org (full text, mbox):

From: Tomasz Dziendzielski <tomasz.dziendzielski <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 39678 <at> debbugs.gnu.org
Subject: Re: bug#39678: 'grep --ignore-case --color' does not always color the
 matches
Date: Wed, 20 Oct 2021 03:10:25 +0200
[Message part 1 (text/plain, inline)]
>> I found that for optimization purposes common cases in
re_string_peek_byte
>> are missed if pstr->mbs_allocated is true, thus skipping them also for
case
>> insensitive.

>Unfortunately I've lost context here. Is this patch being proposed to
>improve performance, or to fix a bug?

>I don't see how the patch preserves correctness: if pstr->icase is true,
>surely in general we need to look at the translated string, not the
>original.

It's to fix the bug. Without the patch if we use icase this
re_string_peek_byte was not executed and as a consequence match is not
found.
If we apply the patch grep is able to match the '\a' correctly. The patch
has nothing to do with improving performance, if was just an additional
comment from my side that the fix should not affect the performance (in my
opinion). Sorry for confusion.

Best regards,
Tomasz Dziendzielski
[Message part 2 (text/html, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#39678; Package grep. (Tue, 02 Nov 2021 19:26:01 GMT) Full text and rfc822 format available.

Message #56 received at 39678 <at> debbugs.gnu.org (full text, mbox):

From: Tomasz Dziendzielski <tomasz.dziendzielski <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 39678 <at> debbugs.gnu.org
Subject: Re: bug#39678: 'grep --ignore-case --color' does not always color the
 matches
Date: Tue, 2 Nov 2021 20:24:51 +0100
[Message part 1 (text/plain, inline)]
Hi,
so is my patch correct? Without my patch re_string_peek_byte() is not
executed - and this is the function that is executed (and where the match
is found) when "--ignore-case" is not enabled. With my patch grep is able
to match the '\a' while still preserving the correctness for other cases.
For sure some tests would be helpful but from what I checked my patch is
not breaking anything else while adding the correct match for '\a' case.

If you still think the patch is wrong, could you please point me in the
right direction? Maybe you could point me where it's not preserving
correctness.

Best regards,
Tomasz Dziendzielski

śr., 20 paź 2021 o 03:10 Tomasz Dziendzielski <
tomasz.dziendzielski <at> gmail.com> napisał(a):

> >> I found that for optimization purposes common cases in
> re_string_peek_byte
> >> are missed if pstr->mbs_allocated is true, thus skipping them also for
> case
> >> insensitive.
>
> >Unfortunately I've lost context here. Is this patch being proposed to
> >improve performance, or to fix a bug?
>
> >I don't see how the patch preserves correctness: if pstr->icase is true,
> >surely in general we need to look at the translated string, not the
> >original.
>
> It's to fix the bug. Without the patch if we use icase this
> re_string_peek_byte was not executed and as a consequence match is not
> found.
> If we apply the patch grep is able to match the '\a' correctly. The patch
> has nothing to do with improving performance, if was just an additional
> comment from my side that the fix should not affect the performance (in my
> opinion). Sorry for confusion.
>
> Best regards,
> Tomasz Dziendzielski
>
[Message part 2 (text/html, inline)]

Merged 39678 51255 51256 51257. Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Wed, 24 Nov 2021 02:52:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#39678; Package grep. (Sat, 21 May 2022 10:06:02 GMT) Full text and rfc822 format available.

Message #61 received at 39678 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: 39678 <at> debbugs.gnu.org
Cc: Tomasz Dziendzielski <tomasz.dziendzielski <at> gmail.com>,
 Jim Meyering <jim <at> meyering.net>
Subject: POSIXLY_CORRECT removal, and oddball regex doc
Date: Sat, 21 May 2022 03:05:38 -0700
Looking again at grep bug 39678 <https://bugs.gnu.org/39678> I noticed 
that the bug occurs even when grep is not coloring:

echo a | grep -oi --color=never '\a'

This outputs nothing and exits with status 0, which is clearly wrong.

I tracked this down to a bug buried deep in the bowels of glibc regex, a 
bug that Tomasz also spotted. It's not trivial to fix (the fix that 
Tomasz sent in doesn't feel right, at least for \X where X is a 
multibyte letter) and any fix would be low priority since the bug occurs 
only in regular expressions like '\a' that have unspecified behavior - 
which means the behavior though wrong nevertheless conforms to POSIX.

I'm inclined to address this by having GNU 'grep' diagnose unspecified 
regexps like '\a' and exit with status 2, much as it already diagnoses 
unspecified regexps like '[:alpha:]'. If this approach sounds too 
drastic, a gentler approach would be for 'grep' to warn about '\a' 
without changing the exit status for now, and escalate the warning to 
exit with status 2 in a later 'grep' release.




Information forwarded to bug-grep <at> gnu.org:
bug#39678; Package grep. (Sat, 21 May 2022 18:42:01 GMT) Full text and rfc822 format available.

Message #64 received at 39678 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 39678 <at> debbugs.gnu.org,
 Tomasz Dziendzielski <tomasz.dziendzielski <at> gmail.com>
Subject: Re: POSIXLY_CORRECT removal, and oddball regex doc
Date: Sat, 21 May 2022 11:40:58 -0700
On Sat, May 21, 2022 at 3:05 AM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> Looking again at grep bug 39678 <https://bugs.gnu.org/39678> I noticed
> that the bug occurs even when grep is not coloring:
>
> echo a | grep -oi --color=never '\a'
>
> This outputs nothing and exits with status 0, which is clearly wrong.

Wrong, indeed!

> I tracked this down to a bug buried deep in the bowels of glibc regex, a
> bug that Tomasz also spotted. It's not trivial to fix (the fix that
> Tomasz sent in doesn't feel right, at least for \X where X is a
> multibyte letter) and any fix would be low priority since the bug occurs
> only in regular expressions like '\a' that have unspecified behavior -
> which means the behavior though wrong nevertheless conforms to POSIX.
>
> I'm inclined to address this by having GNU 'grep' diagnose unspecified
> regexps like '\a' and exit with status 2, much as it already diagnoses
> unspecified regexps like '[:alpha:]'. If this approach sounds too
> drastic, a gentler approach would be for 'grep' to warn about '\a'
> without changing the exit status for now, and escalate the warning to
> exit with status 2 in a later 'grep' release.

In my experience, there are many lurking uses of things like '\a', and
would like to ease into this gently, so I much prefer your latter
approach: warn now, and change grep's exit status later -- at least 6
months to a year later, to give people
at least a chance to fix their offending uses before they break.

Thanks for all the improvements!




Information forwarded to bug-grep <at> gnu.org:
bug#39678; Package grep. (Sun, 22 May 2022 22:25:02 GMT) Full text and rfc822 format available.

Message #67 received at 39678 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>
Cc: 39678 <at> debbugs.gnu.org,
 Tomasz Dziendzielski <tomasz.dziendzielski <at> gmail.com>
Subject: Re: bug#39678: POSIXLY_CORRECT removal, and oddball regex doc
Date: Sun, 22 May 2022 15:22:59 -0700
[Message part 1 (text/plain, inline)]
On 5/21/22 11:40, Jim Meyering wrote:
> In my experience, there are many lurking uses of things like '\a', and
> would like to ease into this gently, so I much prefer your latter
> approach: warn now, and change grep's exit status later

Sounds good.

When I started looking into that, I discovered that the grep manual 
doesn't cover these lurkers well. And although I installed a patch 
yesterday about this, after looking at the POSIX spec again today I 
discovered that I'd missed quite a few lurkers. So I just now installed 
the attached documentation fix, which attempts to cover all the 
remaining problem regexps, and to give us room to add warnings for some 
of them soon.

We shouldn't warn about all these problems, not without a --pedantic 
flag or something like that (something I'm probably too busy to add). 
But I expect it'd be good to warn about areas where grep's semantics 
don't match any reasonable expectation.

We've already uncovered one area, where \a doesn't work as expected and 
where a warning diagnostic would be helpful. Here's another one, where 
an oddly-placed '*' doesn't work as one would expect:

$ printf '*\na\n*a\n' | grep '\(*\)'
*
*a
$ printf '*\na\n*a\n' | grep -E '(*)'
grep: Unmatched ( or \(
$ printf '*\na\n*a\n' | grep '\(*a\)'
*a
$ printf '*\na\n*a\n' | grep -E '(*a)'
a
*a

Although not a POSIX violation, here 'grep -E' is "wrong" for any 
reasonable definition of "wrong" that I can think of. The attached patch 
changes the doc to say that this regular expression has unspecified 
behavior (something that POSIX allows).

(Who would have thought regular expressions were so complicated? :-)
[0001-doc-document-regex-corner-cases-better.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#39678; Package grep. (Mon, 23 May 2022 20:04:02 GMT) Full text and rfc822 format available.

Message #70 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: bug-grep <at> gnu.org
Subject: Re: bug#39678: warn about stray backslashes
Date: Mon, 23 May 2022 13:03:00 -0700
[Message part 1 (text/plain, inline)]
On 5/22/22 15:22, Paul Eggert wrote:

> We've already uncovered one area, where \a doesn't work as expected and 
> where a warning diagnostic would be helpful.

I installed the attached patch to cause 'grep' to warn about these. 
Comments welcome. Most of the changes were in Gnulib's dfa module, which 
see.


> Here's another one, where an oddly-placed '*' doesn't work as one would expect:
> 
> $ printf '*\na\n*a\n' | grep '\(*\)'
> *
> *a
> $ printf '*\na\n*a\n' | grep -E '(*)'
> grep: Unmatched ( or \(
> $ printf '*\na\n*a\n' | grep '\(*a\)'
> *a
> $ printf '*\na\n*a\n' | grep -E '(*a)'
> a
> *a 

I plan to look at this next. We shouldn't warn about BREs like '\(*\)' 
and '\(*a\)' as these conform to POSIX and work fine. But it makes sense 
to warn about EREs like '(*)', '(*a)', '(+)', '(+a)', '({1})', '({1}a)' 
as POSIX does not specify their behavior, their semantics are 
unpredictable with GNU grep, and it's plausible that people are making 
mistakes in this area.
[0001-build-update-gnulib-submodule-to-latest.patch (text/x-patch, attachment)]
[0002-grep-warn-about-stray-backslashes.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#39678; Package grep. (Wed, 25 May 2022 00:53:02 GMT) Full text and rfc822 format available.

Message #73 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: bug-grep <at> gnu.org
Subject: Re: bug#39678: warn about stray backslashes
Date: Tue, 24 May 2022 17:52:48 -0700
[Message part 1 (text/plain, inline)]
On 5/23/22 13:03, Paul Eggert wrote:
> it makes sense to warn about EREs like '(*)', '(*a)', '(+)', '(+a)', 
> '({1})', '({1}a)' as POSIX does not specify their behavior, their 
> semantics are unpredictable with GNU grep, and it's plausible that 
> people are making mistakes in this area.

I installed the attached to do that. As before, most of the changes were 
in Gnulib's dfa module.

With all these changes, we now see behavior like this:

$ echo a | src/grep -oi '\a'
src/grep: warning: stray \ before a

which is not ideal (so I'll leave the bug report open), but at least 
this gives the user a warning that the pattern is not on the up-and-up.
[0001-build-update-gnulib-submodule-to-latest.patch (text/x-patch, attachment)]
[0002-grep-warn-about-x-etc.patch (text/x-patch, attachment)]

This bug report was last modified 1 year and 338 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.