GNU bug report logs -
#16867
[bug #37600] grep -w cuts words on non-ascii
Previous Next
Reported by: Jim Meyering <jim <at> meyering.net>
Date: Mon, 24 Feb 2014 16:54:01 UTC
Severity: normal
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 16867 in the body.
You can then email your comments to 16867 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#16867
; Package
grep
.
(Mon, 24 Feb 2014 16:54:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Jim Meyering <jim <at> meyering.net>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Mon, 24 Feb 2014 16:54:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Re the savannah bug report, http://savannah.gnu.org/bugs/?37600
[Let's continue on the mailing list -- now our preferred medium]
On Mon, Feb 24, 2014 at 6:57 AM, Stephane Chazelas wrote:
[...]
Thanks for the report.
I confirm it is still a problem with the latest, grep-2.18:
[Note that there's nothing special about the following multi-byte
character or about the locale. ]
$ printf 'x\nx\xc3\xa5\n' |LC_ALL=en_US.utf8 grep --color 'x\b'
x
xå
This is pretty serious:
$ printf 'p\xc3\xa8re\n' |LC_ALL=en_US.utf8 grep -w p
père
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16867
; Package
grep
.
(Mon, 24 Feb 2014 21:39:02 GMT)
Full text and
rfc822 format available.
Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):
2014-02-24 08:53:17 -0800, Jim Meyering:
[...]
> This is pretty serious:
>
> $ printf 'p\xc3\xa8re\n' |LC_ALL=en_US.utf8 grep -w p
> père
I gets more complicated with combining characters:
$ printf 'pe\314\200re\n' | grep -w pe
père
You can't expect \w to match U+0300 alone. You can't expect \w to
match two characters (e with U+0300) either.
It feels wrong that grep finds a word boundary inside a single
graphem though (between e and its grave accent).
I suppose one way to address the problem would be an option that
turns anything that matches a single character (., [xy], \w,
\s...) into something that matches a graphem, or if not maybe a
"combining character sequence"
http://www.unicode.org/faq/char_combmark.html for more details.
That's not a grep only problem though.
I suppose it gets even more complicated with non-latin alphabets
or non-alphabetic languages.
\w, -w, \b, \<, \> are not "standard" features, so GNU may
decide what they want to do with it. Restricting it to ascii
a-zA-Z0-9_ (which is not even word constituents in English, but
appears to match C identifiers which is probably what it was
designed for in the first place) is as good a choice as any I
would say.
Changing it might break things. Adding other ways to match
unicode characters properties (like PCRE's \p{...}) may be a
better approach.
--
Stephane
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16867
; Package
grep
.
(Tue, 06 May 2014 07:03:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 16867 <at> debbugs.gnu.org (full text, mbox):
> This is pretty serious:
>
> $ printf 'p\xc3\xa8re\n' |LC_ALL=en_US.utf8 grep -w p
> père
I installed a fix for this just now. This doesn't fix all of Bug#16867,
just this particular issue.
Here's the fix:
http://git.savannah.gnu.org/cgit/grep.git/commit/?id=94555dd281cdcd7530bc2c4466f0bbfd8d47d5c0
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16867
; Package
grep
.
(Wed, 07 May 2014 01:16:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 16867 <at> debbugs.gnu.org (full text, mbox):
On Tue, May 6, 2014 at 12:02 AM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
>> This is pretty serious:
>>
>> $ printf 'p\xc3\xa8re\n' |LC_ALL=en_US.utf8 grep -w p
>> père
>
> I installed a fix for this just now. This doesn't fix all of Bug#16867,
> just this particular issue.
>
> Here's the fix:
>
> http://git.savannah.gnu.org/cgit/grep.git/commit/?id=94555dd281cdcd7530bc2c4466f0bbfd8d47d5c0
It's a pleasure to read yet another bug-fix patch that also
makes the code cleaner.
Thank you.
Reply sent
to
Paul Eggert <eggert <at> cs.ucla.edu>
:
You have taken responsibility.
(Sat, 10 May 2014 23:31:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Jim Meyering <jim <at> meyering.net>
:
bug acknowledged by developer.
(Sat, 10 May 2014 23:31:03 GMT)
Full text and
rfc822 format available.
Message #19 received at 16867-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
I've installed the attached patch, which fixes the bug for me, and am
marking this bug report as done.
[0001-dfa-fix-bug-with-etc-in-multibyte-locales.patch (text/plain, attachment)]
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Sun, 08 Jun 2014 11:24:04 GMT)
Full text and
rfc822 format available.
This bug report was last modified 9 years and 324 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.