GNU bug report logs - #16919
[PATCH] fix mismatch between dfa and regex for treatment of titlecase

Previous Next

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Sun, 2 Mar 2014 00:34:01 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 16919 in the body.
You can then email your comments to 16919 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#16919; Package grep. (Sun, 02 Mar 2014 00:34:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Sun, 02 Mar 2014 00:34:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: submit <at> debbugs.gnu.org
Subject: [PATCH] fix mismatch between dfa and regex for treatment of titlecase
Date: Sun, 02 Mar 2014 09:32:38 +0900
[Message part 1 (text/plain, inline)]
Package: grep
Tags: patch

I found difference between dfa and regex (glibc) treatment of titlecase.
In case-insensitive matching in UTF8 locale,  U+01C7 (LATIN CAPITAL LETTER
LJ) matches with U+01C8 (LATIN CAPITAL LETTER L WITH SMALL LETTER J on
regex, but it doesn't on dfa.

The patch fixes mismatch between dfa and regex for treatment of titlecase.

Norihiro
[patch.txt (application/octet-stream, attachment)]
[tests.txt (application/octet-stream, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16919; Package grep. (Sun, 02 Mar 2014 20:38:02 GMT) Full text and rfc822 format available.

Message #8 received at 16919 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, 16919 <at> debbugs.gnu.org
Subject: Re: bug#16919: [PATCH] fix mismatch between dfa and regex for
 treatment of titlecase
Date: Sun, 02 Mar 2014 12:37:09 -0800
Norihiro Tanaka wrote:

> I found difference between dfa and regex (glibc) treatment of titlecase.

Thanks for bringing this up, but I'm afraid that it appears that regex 
is buggy in this area.  The regex code does the match by converting 
pattern and text to uppercase, and then trying a match with uppercase. 
But this is incorrect for an example like the following, which uses 
'\(\)\1' to force using the regex code:

echo 'ς' | grep -i '\(\)\1σ'

This should output nothing, because terminal sigma is not the same as 
lowercase sigma even when case is ignored.  But since the uppercase 
counterpart of both characters is capital sigma, grep incorrectly 
outputs the terminal sigma.  The dfa code gets it right.

POSIX is muddy in this area, unfortunately, but I don't see any 
interpretation whereby ς and σ should match when case is ignored.




Information forwarded to bug-grep <at> gnu.org:
bug#16919; Package grep. (Mon, 03 Mar 2014 05:06:02 GMT) Full text and rfc822 format available.

Message #11 received at 16919 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, 16919 <at> debbugs.gnu.org
Subject: Re: bug#16919: [PATCH] fix mismatch between dfa and regex for
 treatment of titlecase
Date: Sun, 02 Mar 2014 21:05:10 -0800
Paul Eggert wrote:
> POSIX is muddy in this area, unfortunately, but I don't see any
> interpretation whereby ς and σ should match when case is ignored.

On second thought, I may have been too strict here.  I suppose one could 
interpret POSIX to say that since 'σ' == tolower (toupper ('ς')), that 
it should be OK for the pattern 'σ' to match the string 'ς' when 
ignoring case, even though the characters differ and are both lowercase.

Still, there's a problem that the dfa code behaves differently than the 
regex code.  One way to make them behave the same would be for the dfa 
code to set *backref to 1 when ignoring case and when an alphabetic 
character in the pattern is matched in a non-POSIX locale.  This would 
have performance consequences, though.  Perhaps we can think of a better 
way.




Information forwarded to bug-grep <at> gnu.org:
bug#16919; Package grep. (Mon, 03 Mar 2014 23:58:02 GMT) Full text and rfc822 format available.

Message #14 received at 16919 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 16919 <at> debbugs.gnu.org
Subject: Re: bug#16919: [PATCH] fix mismatch between dfa and regex for
 treatment of titlecase
Date: Tue, 04 Mar 2014 08:57:03 +0900
Paul Eggert wrote:
> On second thought, I may have been too strict here.  I suppose one
> could interpret POSIX to say that since 'σ' == tolower (toupper ('?')),
> that it should be OK for the pattern 'σ' to match the string '?' when
> ignoring case, even though the characters differ and are both lowercase.

I don't think such.  Final sigma is neither uppwercase nor lowercase of
sigma, as you have said firstly.  In addition, it isn't even titlecase
of sigma.

Even if that behavior of regex is right, it won't work correctly for a
character that only one lowercase is assigned to two uppercases. (Though,
I don't know such a character.)

Norihiro





Information forwarded to bug-grep <at> gnu.org:
bug#16919; Package grep. (Tue, 04 Mar 2014 07:02:01 GMT) Full text and rfc822 format available.

Message #17 received at 16919 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 16919 <at> debbugs.gnu.org
Subject: Re: bug#16919: [PATCH] fix mismatch between dfa and regex for
 treatment of titlecase
Date: Mon, 03 Mar 2014 23:01:35 -0800
Norihiro Tanaka wrote:
> Final sigma is neither uppwercase nor lowercase of
> sigma, as you have said firstly.  In addition, it isn't even titlecase
> of sigma.

That's what I thought, too.  Certainly an implementation should be free 
to say that stigma and sigma do not match when ignoring case.  That is 
how Solaris 11 /usr/xpg4/bin/grep -i behaves, for example.  But it's not 
clear that POSIX requires such an implementation.

> Even if that behavior of regex is right, it won't work correctly for a
> character that only one lowercase is assigned to two uppercases. (Though,
> I don't know such a character.)

Here's an example: 'İ' (U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE) 
along with ASCII 'i' and 'I'.  On GNU and Solaris platforms in the 
en_US.UTF-8 locale, towlower (L'İ') == L'i', so 'i' has two uppercase 
counterparts.  Solaris 11 /usr/xpg4/bin/grep -i says that 'i' and 'I' 
match each other but do not match 'İ', and reports a syntax error when 
attempting to match 'İ' to itself:

$ echo 'İ' | /usr/xpg4/bin/grep -i 'İ'
grep: input file "(standard input)": line 1: syntax error

so Solaris 'grep' is clearly busted here.  The GNU regex code says that 
'i' and 'I' match only each other, and that 'İ' matches only itself. 
But the dfa code in the current trunk behaves differently: it says that 
the patterns 'i' and 'I' match each other, whereas the pattern 'İ' 
matches itself and 'i'.  So we have inconsistent behavior with the grep 
current master 5aa1413:

$ printf 'İ\ni\nI\n' | grep -i 'İ' # Use the DFA code.
İ
i
$ printf 'İ\ni\nI\n' | grep -i '\(\)\1İ' # Use the regex code.
İ

Here the DFA behavior is more plausible; but the DFA code is supposed to 
accelerate the regex code without changing behavior, so there is a problem.

After writing the above, I discovered a web page that talks about this 
issue, written about 10 years ago and committed by Charles Levert:

http://www.gnu.org/software/grep/devel.html

Look for the string "POSIX and --ignore-case".  The regex code takes the 
"uc(input_wchar) == uc(pattern_wchar)" approach, whereas the DFA code 
takes an approach not mentioned in that web page, namely "input_wchar == 
pattern_wchar || input_wchar == lc(pattern_wchar) || input_wchar == 
uc(pattern_wchar)".

I'm inclined to fix the DFA code so that it behaves like the regex code 
even if the regex code isn't that nice, because arguably the regex code 
does conform to POSIX and anyway its behavior is longstanding and I 
doubt whether people care all that much about the details so long as the 
two are consistent.




Information forwarded to bug-grep <at> gnu.org:
bug#16919; Package grep. (Tue, 04 Mar 2014 19:38:03 GMT) Full text and rfc822 format available.

Message #20 received at 16919 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, 16919 <at> debbugs.gnu.org
Subject: Re: bug#16919: [PATCH] fix mismatch between dfa and regex for
 treatment of titlecase
Date: Tue, 4 Mar 2014 11:36:57 -0800
On Mon, Mar 3, 2014 at 11:01 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
...
> I'm inclined to fix the DFA code so that it behaves like the regex code even
> if the regex code isn't that nice, because arguably the regex code does
> conform to POSIX and anyway its behavior is longstanding and I doubt whether
> people care all that much about the details so long as the two are
> consistent.

That is my preference, too.  Thanks for digging into this.




Information forwarded to bug-grep <at> gnu.org:
bug#16919; Package grep. (Wed, 05 Mar 2014 14:12:02 GMT) Full text and rfc822 format available.

Message #23 received at 16919 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro TANAKA <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 16919 <at> debbugs.gnu.org
Subject: Re: bug#16919: [PATCH] fix mismatch between dfa and regex for
 treatment of titlecase
Date: Wed, 05 Mar 2014 23:11:40 +0900
[Message part 1 (text/plain, inline)]
Hi Paul,

Thanks for a lot of investigation.  I have understood that we cannot
generally tell whether DFA's or regex's behavior is right.

I have tested the behavior of sereral regex engines.  What's interesting
is that most of results differ from others.  And nobody will understand
which is right.

--
GNU grep (DFA):

$ env LANG=en_US.utf8 ./test.sh "src/grep -i" 2>/dev/null | nl -ba
     1   c7 87 | c7 89
     2   c7 87 | c7 88 | c7 89
     3   c7 87 | c7 89
     4   49 | 69
     5   49 | 69
     6   69 | c4 b0
     7   49 | c4 b1

GNU grep (regex):

$ env LANG=en_US.utf8 ./test.sh "src/grep -i" '\(\)\1' 2>/dev/null | nl -ba
     1   c7 87 | c7 88 | c7 89
     2   c7 87 | c7 88 | c7 89
     3   c7 87 | c7 88 | c7 89
     4   49 | 69 | c4 b1
     5   49 | 69 | c4 b1
     6   c4 b0
     7   49 | 69 | c4 b1

pcregrep:

$ env LANG=en_US.utf8 ./test.sh "pcregrep -iu" 2>/dev/null | nl -ba
     1   c7 87 | c7 88 | c7 89
     2   c7 87 | c7 88 | c7 89
     3   c7 87 | c7 88 | c7 89
     4   49 | 69
     5   49 | 69
     6   c4 b0
     7   c4 b1

Solaris grep (xpg4):

$ env LANG=ja_JP.UTF-8 ./test.sh  "/usr/xpg4/bin/grep -i" 2>/dev/null | nl -ba
     1           c7 87 | c7 89
     2           c7 88
     3           c7 87 | c7 89
     4           49 | 69
     5           49 | 69
     6           error
     7           error

HP-UX grep:

$ env LANG=en_US.utf8 ./test.sh  "/bin/grep -i" 2>/dev/null | nl -ba
     1              c7    87     |    c7    88     |    c7    89
     2              c7    87     |    c7    88     |    c7    89
     3              c7    87     |    c7    88     |    c7    89
     4              49     |    69
     5              49     |    69
     6              c4    b0
     7              c4    b1
--

Norihiro
[test.sh (application/octet-stream, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16919; Package grep. (Wed, 05 Mar 2014 15:12:02 GMT) Full text and rfc822 format available.

Message #26 received at 16919 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Norihiro TANAKA <noritnk <at> kcn.ne.jp>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 16919 <at> debbugs.gnu.org
Subject: Re: bug#16919: [PATCH] fix mismatch between dfa and regex for
 treatment of titlecase
Date: Thu, 06 Mar 2014 00:11:18 +0900
Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> And nobody will understand which is right.

However, I still believe that upper or lower case of a character should
also match title case, because I think that title case is extension of
cases (such as upper or lower), and furthermore they also matches title
case (though it may be by chance) in regex.

Norihiro





Information forwarded to bug-grep <at> gnu.org:
bug#16919; Package grep. (Wed, 05 Mar 2014 18:51:03 GMT) Full text and rfc822 format available.

Message #29 received at 16919 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 16919 <at> debbugs.gnu.org
Subject: Re: bug#16919: [PATCH] fix mismatch between dfa and regex for
 treatment of titlecase
Date: Wed, 05 Mar 2014 10:50:54 -0800
On 03/05/2014 07:11 AM, Norihiro Tanaka wrote:
> I still believe that upper or lower case of a character should
> also match title case

The (soon-to-be-fixed) gnulib regex code agrees with you, assuming that 
towupper (X) agrees for all three values of X, because it uses (towupper 
(input) == towupper (pattern)). However, the most-plausible reading of 
POSIX does not agree with you, as it would require (input == pattern || 
towlower (input) == pattern || towupper (input) == pattern), which means 
a titlecase pattern will match only itself.

It seems pretty clear to me that the most-plausible reading of POSIX is 
buggy, for this reason.  No wonder so many implementations fail to 
conform to it.

I thought of a different way where gnulib/glibc regex does not conform 
to POSIX, and here there doesn't seem to be any ambiguity about it.  In 
the POSIX locale when ignoring case, the pattern '[Z-a]' matches the 
data 'Z', 'z', 'A', 'a', and the nonalphabetic characters like '^' that 
collate between 'Z' and 'a'.  But the glibc regex code rejects that 
pattern entirely.  Conversely, in the same situation the glibc regex 
code says '[A-z]' matches only alphabetic characters, whereas POSIX says 
it should also match the nonalphabetic characters like '^' that collate 
between 'Z' and 'a'.  It appears that nobody cares, as this 
incompatibility has been present for years and I don't recall anyone 
complaining.  Though it is weird that this means "grep PAT" can match 
some lines that "grep -i PAT" doesn't.

Here POSIX is not merely ambiguous, it's clearly disagreeing with common 
practice.  It's not clear whether the bug is in POSIX or in the 
implementation.




Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Sat, 08 Mar 2014 02:34:02 GMT) Full text and rfc822 format available.

Notification sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
bug acknowledged by developer. (Sat, 08 Mar 2014 02:34:02 GMT) Full text and rfc822 format available.

Message #34 received at 16919-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>
Cc: 16919-done <at> debbugs.gnu.org, Aharon Robbins <arnold <at> skeeve.com>,
 Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Subject: Re: bug#16919: [PATCH] fix mismatch between dfa and regex for
 treatment of titlecase
Date: Fri, 07 Mar 2014 18:33:50 -0800
[Message part 1 (text/plain, inline)]
Jim Meyering wrote:
> That is my preference, too.

OK, thanks, I installed the attached patch, which should do just that, 
and I'm marking this bug as done.  I'll CC: this to Aharon since it 
might affect Gawk.

The code still needs cleanup but so far I'm just trying to fix bugs.
[0001-grep-fix-case-fold-mismatches-between-DFA-and-regex.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16919; Package grep. (Sat, 08 Mar 2014 03:24:02 GMT) Full text and rfc822 format available.

Message #37 received at 16919-done <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 16919-done <at> debbugs.gnu.org, Aharon Robbins <arnold <at> skeeve.com>,
 Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Subject: Re: bug#16919: [PATCH] fix mismatch between dfa and regex for
 treatment of titlecase
Date: Fri, 7 Mar 2014 19:22:56 -0800
On Fri, Mar 7, 2014 at 6:33 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> Jim Meyering wrote:
>>
>> That is my preference, too.
>
>
> OK, thanks, I installed the attached patch, which should do just that, and
> I'm marking this bug as done.

Impressive.  After a first pass, I'd say it's perfect.
Thanks a lot for all that fine work!




Information forwarded to bug-grep <at> gnu.org:
bug#16919; Package grep. (Sat, 08 Mar 2014 03:44:02 GMT) Full text and rfc822 format available.

Message #40 received at 16919-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>
Cc: 16919-done <at> debbugs.gnu.org, Aharon Robbins <arnold <at> skeeve.com>,
 Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Subject: Re: bug#16919: [PATCH] fix mismatch between dfa and regex for
 treatment of titlecase
Date: Fri, 07 Mar 2014 19:42:56 -0800
[Message part 1 (text/plain, inline)]
I noticed that fgrep disagreed with the regex code as well, so while I'm 
in the neighborhood I applied the attached patch.
[0001-fgrep-fix-case-fold-incompatibility-with-plain-grep.patch (text/plain, attachment)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 05 Apr 2014 11:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 10 years and 29 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.