GNU bug report logs -
#24975
Matching issues with characters whose encoding ends in some other character
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 24975 in the body.
You can then email your comments to 24975 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#24975
; Package
grep
.
(Sun, 20 Nov 2016 21:51:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Stephane Chazelas <stephane.chazelas <at> gmail.com>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Sun, 20 Nov 2016 21:51:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
$ locale charmap
GB18030
$ printf '\uC9\n' | grep '.*7' | hd
00000000 81 30 87 37 0a |.0.7.|
00000005
U+00C9's encoding does end in the 0x37 byte (7 in ASCII and GB18030).
$ printf '\uC9\n' | grep '.*0'
fails.
$ printf '\uC9\n' | grep -o '.*7'
returns with a zero exit status but outputs nothing. It's as if
.*7 matched an empty string somewhere.
printf '\uC9\n' | grep '\(.*7\)\1'
fails.
so do:
grep 7
grep '7$'
grep '.7'
grep '[^x]*7'
printf 'x\uC9\n' | grep -E '.+7'
These match:
grep '.\{0,1\}7'
grep -E '.?7'
printf '\uC9x\n' | grep '.*7x' # still outputs nothing with -o
That's not confined to GB18030. You get similar issues with
BIG5-HKSCS, BIG5 or GBK.
$ locale charmap
BIG5-HKSCS
$ printf '\ue9\n' | grep '.*m' | hd
00000000 88 6d 0a |.m.|
00000003
Reproduced with 2.25, 2.26 and the current git head on ubuntu 16.04 amd64.
--
Stephane
Information forwarded
to
bug-grep <at> gnu.org
:
bug#24975
; Package
grep
.
(Sun, 20 Nov 2016 23:00:02 GMT)
Full text and
rfc822 format available.
Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):
2016-11-20 21:50:28 +0000, Stephane Chazelas:
> $ locale charmap
> GB18030
> $ printf '\uC9\n' | grep '.*7' | hd
> 00000000 81 30 87 37 0a |.0.7.|
> 00000005
>
> U+00C9's encoding does end in the 0x37 byte (7 in ASCII and GB18030).
[...]
> Reproduced with 2.25, 2.26 and the current git head on ubuntu 16.04 amd64.
[...]
Same behaviour with 2.26 on Solaris 11.
--
Stephane
Information forwarded
to
bug-grep <at> gnu.org
:
bug#24975
; Package
grep
.
(Mon, 21 Nov 2016 05:54:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 24975 <at> debbugs.gnu.org (full text, mbox):
On Sun, Nov 20, 2016 at 2:59 PM, Stephane Chazelas
<stephane.chazelas <at> gmail.com> wrote:
> 2016-11-20 21:50:28 +0000, Stephane Chazelas:
>> $ locale charmap
>> GB18030
>> $ printf '\uC9\n' | grep '.*7' | hd
>> 00000000 81 30 87 37 0a |.0.7.|
>> 00000005
>>
>> U+00C9's encoding does end in the 0x37 byte (7 in ASCII and GB18030).
> [...]
>> Reproduced with 2.25, 2.26 and the current git head on ubuntu 16.04 amd64.
> [...]
>
> Same behaviour with 2.26 on Solaris 11.
Thank you for the report.
I can reproduce that error on Fedora 25 with this:
$ (export LC_ALL=zh_CN.gb18030; printf '\uC9\n' > k; grep '.*7' k)|wc -c
5
I confirmed that the problem does not arise (i.e., no match, with exit
status of 1) when we force the use of glibc's regex matcher by
inserting a trivial back-reference:
$ (export LC_ALL=zh_CN.gb18030; printf '\uC9\n' > k; grep -E
'()\1.*7' k); echo $?
1
This bisected to v2.18-54-g3ef4c8e, but that commit was just the
messenger: it exposed the latent bug by making it so this case was no
longer handled by glibc's regexp matcher, but rather by grep's dfa.c.
Reply sent
to
Jim Meyering <jim <at> meyering.net>
:
You have taken responsibility.
(Mon, 28 Nov 2016 00:00:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Stephane Chazelas <stephane.chazelas <at> gmail.com>
:
bug acknowledged by developer.
(Mon, 28 Nov 2016 00:00:02 GMT)
Full text and
rfc822 format available.
Message #16 received at 24975-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Sun, Nov 20, 2016 at 9:53 PM, Jim Meyering <jim <at> meyering.net> wrote:
> On Sun, Nov 20, 2016 at 2:59 PM, Stephane Chazelas
> <stephane.chazelas <at> gmail.com> wrote:
>> 2016-11-20 21:50:28 +0000, Stephane Chazelas:
>>> $ locale charmap
>>> GB18030
>>> $ printf '\uC9\n' | grep '.*7' | hd
>>> 00000000 81 30 87 37 0a |.0.7.|
>>> 00000005
>>>
>>> U+00C9's encoding does end in the 0x37 byte (7 in ASCII and GB18030).
>> [...]
>>> Reproduced with 2.25, 2.26 and the current git head on ubuntu 16.04 amd64.
>> [...]
>>
>> Same behaviour with 2.26 on Solaris 11.
>
> Thank you for the report.
> I can reproduce that error on Fedora 25 with this:
>
> $ (export LC_ALL=zh_CN.gb18030; printf '\uC9\n' > k; grep '.*7' k)|wc -c
> 5
>
> I confirmed that the problem does not arise (i.e., no match, with exit
> status of 1) when we force the use of glibc's regex matcher by
> inserting a trivial back-reference:
>
> $ (export LC_ALL=zh_CN.gb18030; printf '\uC9\n' > k; grep -E
> '()\1.*7' k); echo $?
> 1
>
> This bisected to v2.18-54-g3ef4c8e, but that commit was just the
> messenger: it exposed the latent bug by making it so this case was no
> longer handled by glibc's regexp matcher, but rather by grep's dfa.c.
I've fixed this by forcing any non-UTF8 multibyte locale to use regex
rather than DFA matcher with the following.
The gnulib/dfa patch makes that change, and the grep change updates to
latest gnulib, adds tests and NEWS.
I suspect this won't be the last word in this area, because it feels
like we should be able to adjust DFA's tables so that people using
such locales can retain DFA's efficiency without the bug in the
current implementation.
[gnulib-dfa-mb-non-UTF8-fix.diff (text/plain, attachment)]
[grep-fix-false-matches-mb-non-UTF8.diff (text/plain, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#24975
; Package
grep
.
(Mon, 28 Nov 2016 13:50:01 GMT)
Full text and
rfc822 format available.
Message #19 received at 24975 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Jim Meyering <jim <at> meyering.net> wrote:
> I suspect this won't be the last word in this area, because it feels
> like we should be able to adjust DFA's tables so that people using
> such locales can retain DFA's efficiency without the bug in the
> current implementation.
Hi Jim,
It is a bug in dfa for period expression in non-UTF8 locales. dfa
calculates transition for single byte characters and a multibyte
character separately and merge both results. However, if backs to
an initial state in transition for single byte characters, we should
stop matching single byte characters.
Thanks,
Norihiro
[0001-dfa-avoid-match-middle-in-multibyte-character.patch (text/plain, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#24975
; Package
grep
.
(Mon, 28 Nov 2016 14:49:02 GMT)
Full text and
rfc822 format available.
Message #22 received at 24975 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Jim Meyering <jim <at> meyering.net> wrote:
> I suspect this won't be the last word in this area, because it feels
> like we should be able to adjust DFA's tables so that people using
> such locales can retain DFA's efficiency without the bug in the
> current implementation.
Hi Jim,
It is a bug in dfa for period expression in non-UTF8 locales. dfa
calculates transition for single byte characters and a multibyte
character separately and merge both results. However, if backs to
an initial state in transition for single byte characters, we should
stop matching single byte characters.
Thanks,
Norihiro
[0001-dfa-avoid-match-middle-in-multibyte-character.patch (text/plain, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#24975
; Package
grep
.
(Mon, 28 Nov 2016 16:49:01 GMT)
Full text and
rfc822 format available.
Message #25 received at 24975 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Thanks for that DFA fix, which should be much better than the previous
workaround. I installed it into gnulib and installed the attached patch
into grep.
[0001-build-update-gnulib-submodule-to-latest.patch (application/x-patch, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#24975
; Package
grep
.
(Mon, 28 Nov 2016 17:13:01 GMT)
Full text and
rfc822 format available.
Message #28 received at 24975 <at> debbugs.gnu.org (full text, mbox):
On Mon, Nov 28, 2016 at 5:49 AM, Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> Jim Meyering <jim <at> meyering.net> wrote:
>
>> I suspect this won't be the last word in this area, because it feels
>> like we should be able to adjust DFA's tables so that people using
>> such locales can retain DFA's efficiency without the bug in the
>> current implementation.
>
> Hi Jim,
>
> It is a bug in dfa for period expression in non-UTF8 locales. dfa
> calculates transition for single byte characters and a multibyte
> character separately and merge both results. However, if backs to
> an initial state in transition for single byte characters, we should
> stop matching single byte characters.
Nice work. Thank you.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Tue, 27 Dec 2016 12:24:05 GMT)
Full text and
rfc822 format available.
This bug report was last modified 7 years and 115 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.