GNU bug report logs - #24975
Matching issues with characters whose encoding ends in some other character

Package: grep;

Reported by: Stephane Chazelas <stephane.chazelas <at> gmail.com>

Date: Sun, 20 Nov 2016 21:51:01 UTC

Severity: normal

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 24975 in the body.
You can then email your comments to 24975 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-grep <at> gnu.org:
bug#24975; Package grep. (Sun, 20 Nov 2016 21:51:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Stephane Chazelas <stephane.chazelas <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Sun, 20 Nov 2016 21:51:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: Matching issues with characters whose encoding ends in some other
 character
Date: Sun, 20 Nov 2016 21:50:28 +0000

$ locale charmap
GB18030
$ printf '\uC9\n' | grep  '.*7'  | hd
00000000  81 30 87 37 0a                                    |.0.7.|
00000005

U+00C9's encoding does end in the 0x37 byte (7 in ASCII and GB18030).

$ printf '\uC9\n' | grep  '.*0'

fails.

$ printf '\uC9\n' | grep  -o '.*7'

returns with a zero exit status but outputs nothing. It's as if
.*7 matched an empty string somewhere.

printf '\uC9\n' | grep  '\(.*7\)\1'

fails.

so do:

grep 7
grep '7$'
grep '.7'
grep '[^x]*7'
printf 'x\uC9\n' | grep -E '.+7'

These match:

grep '.\{0,1\}7'
grep -E '.?7'
printf '\uC9x\n' | grep  '.*7x' # still outputs nothing with -o

That's not confined to GB18030. You get similar issues with
BIG5-HKSCS, BIG5 or GBK.

$ locale charmap
BIG5-HKSCS
$ printf '\ue9\n' | grep  '.*m'  | hd
00000000  88 6d 0a                                          |.m.|
00000003

Reproduced with 2.25, 2.26 and the current git head on ubuntu 16.04 amd64.

-- 
Stephane

Information forwarded to bug-grep <at> gnu.org:
bug#24975; Package grep. (Sun, 20 Nov 2016 23:00:02 GMT) Full text and rfc822 format available.

Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: Re: Matching issues with characters whose encoding ends in some
 other character
Date: Sun, 20 Nov 2016 22:59:10 +0000

2016-11-20 21:50:28 +0000, Stephane Chazelas:
> $ locale charmap
> GB18030
> $ printf '\uC9\n' | grep  '.*7'  | hd
> 00000000  81 30 87 37 0a                                    |.0.7.|
> 00000005
> 
> U+00C9's encoding does end in the 0x37 byte (7 in ASCII and GB18030).
[...]
> Reproduced with 2.25, 2.26 and the current git head on ubuntu 16.04 amd64.
[...]

Same behaviour with 2.26 on Solaris 11.

-- 
Stephane

Information forwarded to bug-grep <at> gnu.org:
bug#24975; Package grep. (Mon, 21 Nov 2016 05:54:02 GMT) Full text and rfc822 format available.

Message #11 received at 24975 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Stephane Chazelas <stephane.chazelas <at> gmail.com>
Cc: 24975 <at> debbugs.gnu.org
Subject: Re: bug#24975: Matching issues with characters whose encoding ends in
 some other character
Date: Sun, 20 Nov 2016 21:53:29 -0800

On Sun, Nov 20, 2016 at 2:59 PM, Stephane Chazelas
<stephane.chazelas <at> gmail.com> wrote:
> 2016-11-20 21:50:28 +0000, Stephane Chazelas:
>> $ locale charmap
>> GB18030
>> $ printf '\uC9\n' | grep  '.*7'  | hd
>> 00000000  81 30 87 37 0a                                    |.0.7.|
>> 00000005
>>
>> U+00C9's encoding does end in the 0x37 byte (7 in ASCII and GB18030).
> [...]
>> Reproduced with 2.25, 2.26 and the current git head on ubuntu 16.04 amd64.
> [...]
>
> Same behaviour with 2.26 on Solaris 11.

Thank you for the report.
I can reproduce that error on Fedora 25 with this:

  $ (export LC_ALL=zh_CN.gb18030; printf '\uC9\n' > k; grep '.*7' k)|wc -c
  5

I confirmed that the problem does not arise (i.e., no match, with exit
status of 1) when we force the use of glibc's regex matcher by
inserting a trivial back-reference:

  $ (export LC_ALL=zh_CN.gb18030; printf '\uC9\n' > k; grep -E
'()\1.*7' k); echo $?
  1

This bisected to v2.18-54-g3ef4c8e, but that commit was just the
messenger: it exposed the latent bug by making it so this case was no
longer handled by glibc's regexp matcher, but rather by grep's dfa.c.

Reply sent to Jim Meyering <jim <at> meyering.net>:
You have taken responsibility. (Mon, 28 Nov 2016 00:00:02 GMT) Full text and rfc822 format available.

Notification sent to Stephane Chazelas <stephane.chazelas <at> gmail.com>:
bug acknowledged by developer. (Mon, 28 Nov 2016 00:00:02 GMT) Full text and rfc822 format available.

Message #16 received at 24975-done <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Stephane Chazelas <stephane.chazelas <at> gmail.com>
Cc: "bug-gnulib <at> gnu.org List" <bug-gnulib <at> gnu.org>, 24975-done <at> debbugs.gnu.org
Subject: Re: bug#24975: Matching issues with characters whose encoding ends in
 some other character
Date: Sun, 27 Nov 2016 15:59:05 -0800

[Message part 1 (text/plain, inline)]

On Sun, Nov 20, 2016 at 9:53 PM, Jim Meyering <jim <at> meyering.net> wrote:
> On Sun, Nov 20, 2016 at 2:59 PM, Stephane Chazelas
> <stephane.chazelas <at> gmail.com> wrote:
>> 2016-11-20 21:50:28 +0000, Stephane Chazelas:
>>> $ locale charmap
>>> GB18030
>>> $ printf '\uC9\n' | grep  '.*7'  | hd
>>> 00000000  81 30 87 37 0a                                    |.0.7.|
>>> 00000005
>>>
>>> U+00C9's encoding does end in the 0x37 byte (7 in ASCII and GB18030).
>> [...]
>>> Reproduced with 2.25, 2.26 and the current git head on ubuntu 16.04 amd64.
>> [...]
>>
>> Same behaviour with 2.26 on Solaris 11.
>
> Thank you for the report.
> I can reproduce that error on Fedora 25 with this:
>
>   $ (export LC_ALL=zh_CN.gb18030; printf '\uC9\n' > k; grep '.*7' k)|wc -c
>   5
>
> I confirmed that the problem does not arise (i.e., no match, with exit
> status of 1) when we force the use of glibc's regex matcher by
> inserting a trivial back-reference:
>
>   $ (export LC_ALL=zh_CN.gb18030; printf '\uC9\n' > k; grep -E
> '()\1.*7' k); echo $?
>   1
>
> This bisected to v2.18-54-g3ef4c8e, but that commit was just the
> messenger: it exposed the latent bug by making it so this case was no
> longer handled by glibc's regexp matcher, but rather by grep's dfa.c.

I've fixed this by forcing any non-UTF8 multibyte locale to use regex
rather than DFA matcher with the following.
The gnulib/dfa patch makes that change, and the grep change updates to
latest gnulib, adds tests and NEWS.

I suspect this won't be the last word in this area, because it feels
like we should be able to adjust DFA's tables so that people using
such locales can retain DFA's efficiency without the bug in the
current implementation.

[gnulib-dfa-mb-non-UTF8-fix.diff (text/plain, attachment)]

[grep-fix-false-matches-mb-non-UTF8.diff (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#24975; Package grep. (Mon, 28 Nov 2016 13:50:01 GMT) Full text and rfc822 format available.

Message #19 received at 24975 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: jim <at> meyering.net
Cc: 24975 <at> debbugs.gnu.org, stephane.chazelas <at> gmail.com
Subject: Re: bug#24975: Matching issues with characters whose encoding ends in
 some other character
Date: Mon, 28 Nov 2016 22:49:27 +0900

[Message part 1 (text/plain, inline)]

Jim Meyering <jim <at> meyering.net> wrote:

> I suspect this won't be the last word in this area, because it feels
> like we should be able to adjust DFA's tables so that people using
> such locales can retain DFA's efficiency without the bug in the
> current implementation.

Hi Jim,

It is a bug in dfa for period expression in non-UTF8 locales.  dfa
calculates transition for single byte characters and a multibyte
character separately and merge both results.  However, if backs to
an initial state in transition for single byte characters, we should
stop matching single byte characters.

Thanks,
Norihiro

[0001-dfa-avoid-match-middle-in-multibyte-character.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#24975; Package grep. (Mon, 28 Nov 2016 14:49:02 GMT) Full text and rfc822 format available.

Message #22 received at 24975 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: jim <at> meyering.net
Cc: bug-gnulib <at> gnu.org, 24975 <at> debbugs.gnu.org, stephane.chazelas <at> gmail.com
Subject: Re: bug#24975: Matching issues with characters whose encoding ends in
 some other character
Date: Mon, 28 Nov 2016 23:47:57 +0900

[Message part 1 (text/plain, inline)]

Jim Meyering <jim <at> meyering.net> wrote:

> I suspect this won't be the last word in this area, because it feels
> like we should be able to adjust DFA's tables so that people using
> such locales can retain DFA's efficiency without the bug in the
> current implementation.

Hi Jim,

It is a bug in dfa for period expression in non-UTF8 locales.  dfa
calculates transition for single byte characters and a multibyte
character separately and merge both results.  However, if backs to
an initial state in transition for single byte characters, we should
stop matching single byte characters.

Thanks,
Norihiro

[0001-dfa-avoid-match-middle-in-multibyte-character.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#24975; Package grep. (Mon, 28 Nov 2016 16:49:01 GMT) Full text and rfc822 format available.

Message #25 received at 24975 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, jim <at> meyering.net
Cc: 24975 <at> debbugs.gnu.org, bug-gnulib <at> gnu.org, stephane.chazelas <at> gmail.com
Subject: Re: bug#24975: Matching issues with characters whose encoding ends in
 some other character
Date: Mon, 28 Nov 2016 08:48:29 -0800

[Message part 1 (text/plain, inline)]

Thanks for that DFA fix, which should be much better than the previous 
workaround. I installed it into gnulib and installed the attached patch 
into grep.

[0001-build-update-gnulib-submodule-to-latest.patch (application/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#24975; Package grep. (Mon, 28 Nov 2016 17:13:01 GMT) Full text and rfc822 format available.

Message #28 received at 24975 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 24975 <at> debbugs.gnu.org, Stephane Chazelas <stephane.chazelas <at> gmail.com>
Subject: Re: bug#24975: Matching issues with characters whose encoding ends in
 some other character
Date: Mon, 28 Nov 2016 09:11:55 -0800

On Mon, Nov 28, 2016 at 5:49 AM, Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> Jim Meyering <jim <at> meyering.net> wrote:
>
>> I suspect this won't be the last word in this area, because it feels
>> like we should be able to adjust DFA's tables so that people using
>> such locales can retain DFA's efficiency without the bug in the
>> current implementation.
>
> Hi Jim,
>
> It is a bug in dfa for period expression in non-UTF8 locales.  dfa
> calculates transition for single byte characters and a multibyte
> character separately and merge both results.  However, if backs to
> an initial state in transition for single byte characters, we should
> stop matching single byte characters.

Nice work. Thank you.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 27 Dec 2016 12:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 7 years and 115 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #24975 Matching issues with characters whose encoding ends in some other character

GNU bug report logs - #24975
Matching issues with characters whose encoding ends in some other character