GNU bug report logs - #16867
[bug #37600] grep -w cuts words on non-ascii

Previous Next

Package: grep;

Reported by: Jim Meyering <jim <at> meyering.net>

Date: Mon, 24 Feb 2014 16:54:01 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 16867 in the body.
You can then email your comments to 16867 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#16867; Package grep. (Mon, 24 Feb 2014 16:54:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Jim Meyering <jim <at> meyering.net>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Mon, 24 Feb 2014 16:54:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: bug-grep <at> gnu.org
Cc: kasal <at> ucw.cz, Flammie Pirinen <flammie <at> iki.fi>,
 Paul Eggert <eggert <at> cs.ucla.edu>,
 Stephane Chazelas <stephane.chazelas <at> gmail.com>
Subject: Re: [bug #37600] grep -w cuts words on non-ascii
Date: Mon, 24 Feb 2014 08:53:17 -0800
Re the savannah bug report, http://savannah.gnu.org/bugs/?37600
[Let's continue on the mailing list -- now our preferred medium]

On Mon, Feb 24, 2014 at 6:57 AM, Stephane Chazelas wrote:
[...]

Thanks for the report.
I confirm it is still a problem with the latest, grep-2.18:
[Note that there's nothing special about the following multi-byte
character or about the locale. ]

    $ printf 'x\nx\xc3\xa5\n' |LC_ALL=en_US.utf8 grep --color 'x\b'
    x
    xå

This is pretty serious:

    $ printf 'p\xc3\xa8re\n' |LC_ALL=en_US.utf8 grep -w p
    père




Information forwarded to bug-grep <at> gnu.org:
bug#16867; Package grep. (Mon, 24 Feb 2014 21:39:02 GMT) Full text and rfc822 format available.

Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: Jim Meyering <jim <at> meyering.net>
Cc: kasal <at> ucw.cz, Flammie Pirinen <flammie <at> iki.fi>, bug-grep <at> gnu.org,
 Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: [bug #37600] grep -w cuts words on non-ascii
Date: Mon, 24 Feb 2014 21:38:11 +0000
2014-02-24 08:53:17 -0800, Jim Meyering:
[...]
> This is pretty serious:
> 
>     $ printf 'p\xc3\xa8re\n' |LC_ALL=en_US.utf8 grep -w p
>     père

I gets more complicated with combining characters:

$ printf 'pe\314\200re\n' | grep -w pe
père

You can't expect \w to match U+0300 alone. You can't expect \w to
match two characters (e with U+0300) either.

It feels wrong that grep finds a word boundary inside a single
graphem though (between e and its grave accent).

I suppose one way to address the problem would be an option that
turns anything that matches a single character (., [xy], \w,
\s...) into something that matches a graphem, or if not maybe a
"combining character sequence"

http://www.unicode.org/faq/char_combmark.html for more details.

That's not a grep only problem though.

I suppose it gets even more complicated with non-latin alphabets
or non-alphabetic languages.

\w, -w, \b, \<, \> are not "standard" features, so GNU may
decide what they want to do with it. Restricting it to ascii
a-zA-Z0-9_ (which is not even word constituents in English, but
appears to match C identifiers which is probably what it was
designed for in the first place) is as good a choice as any I
would say.

Changing it might break things. Adding other ways to match
unicode characters properties (like PCRE's \p{...}) may be a
better approach.

-- 
Stephane




Information forwarded to bug-grep <at> gnu.org:
bug#16867; Package grep. (Tue, 06 May 2014 07:03:02 GMT) Full text and rfc822 format available.

Message #11 received at 16867 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: 16867 <at> debbugs.gnu.org
Subject: Re:  [bug #37600] grep -w cuts words on non-ascii
Date: Tue, 06 May 2014 00:02:23 -0700
> This is pretty serious:
>
>     $ printf 'p\xc3\xa8re\n' |LC_ALL=en_US.utf8 grep -w p
>     père

I installed a fix for this just now.  This doesn't fix all of Bug#16867, 
just this particular issue.

Here's the fix:

http://git.savannah.gnu.org/cgit/grep.git/commit/?id=94555dd281cdcd7530bc2c4466f0bbfd8d47d5c0




Information forwarded to bug-grep <at> gnu.org:
bug#16867; Package grep. (Wed, 07 May 2014 01:16:02 GMT) Full text and rfc822 format available.

Message #14 received at 16867 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 16867 <at> debbugs.gnu.org
Subject: Re: bug#16867: [bug #37600] grep -w cuts words on non-ascii
Date: Tue, 6 May 2014 18:14:52 -0700
On Tue, May 6, 2014 at 12:02 AM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
>> This is pretty serious:
>>
>>     $ printf 'p\xc3\xa8re\n' |LC_ALL=en_US.utf8 grep -w p
>>     père
>
> I installed a fix for this just now.  This doesn't fix all of Bug#16867,
> just this particular issue.
>
> Here's the fix:
>
> http://git.savannah.gnu.org/cgit/grep.git/commit/?id=94555dd281cdcd7530bc2c4466f0bbfd8d47d5c0

It's a pleasure to read yet another bug-fix patch that also
makes the code cleaner.

Thank you.




Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Sat, 10 May 2014 23:31:02 GMT) Full text and rfc822 format available.

Notification sent to Jim Meyering <jim <at> meyering.net>:
bug acknowledged by developer. (Sat, 10 May 2014 23:31:03 GMT) Full text and rfc822 format available.

Message #19 received at 16867-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: 16867-done <at> debbugs.gnu.org
Subject: Re: [bug #37600] grep -w cuts words on non-ascii
Date: Sat, 10 May 2014 16:29:48 -0700
[Message part 1 (text/plain, inline)]
I've installed the attached patch, which fixes the bug for me, and am 
marking this bug report as done.
[0001-dfa-fix-bug-with-etc-in-multibyte-locales.patch (text/plain, attachment)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 08 Jun 2014 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 9 years and 324 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.