GNU bug report logs -
#16919
[PATCH] fix mismatch between dfa and regex for treatment of titlecase
Previous Next
Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Date: Sun, 2 Mar 2014 00:34:01 UTC
Severity: normal
Tags: patch
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 16919 in the body.
You can then email your comments to 16919 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#16919
; Package
grep
.
(Sun, 02 Mar 2014 00:34:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Norihiro Tanaka <noritnk <at> kcn.ne.jp>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Sun, 02 Mar 2014 00:34:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Package: grep
Tags: patch
I found difference between dfa and regex (glibc) treatment of titlecase.
In case-insensitive matching in UTF8 locale, U+01C7 (LATIN CAPITAL LETTER
LJ) matches with U+01C8 (LATIN CAPITAL LETTER L WITH SMALL LETTER J on
regex, but it doesn't on dfa.
The patch fixes mismatch between dfa and regex for treatment of titlecase.
Norihiro
[patch.txt (application/octet-stream, attachment)]
[tests.txt (application/octet-stream, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16919
; Package
grep
.
(Sun, 02 Mar 2014 20:38:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 16919 <at> debbugs.gnu.org (full text, mbox):
Norihiro Tanaka wrote:
> I found difference between dfa and regex (glibc) treatment of titlecase.
Thanks for bringing this up, but I'm afraid that it appears that regex
is buggy in this area. The regex code does the match by converting
pattern and text to uppercase, and then trying a match with uppercase.
But this is incorrect for an example like the following, which uses
'\(\)\1' to force using the regex code:
echo 'ς' | grep -i '\(\)\1σ'
This should output nothing, because terminal sigma is not the same as
lowercase sigma even when case is ignored. But since the uppercase
counterpart of both characters is capital sigma, grep incorrectly
outputs the terminal sigma. The dfa code gets it right.
POSIX is muddy in this area, unfortunately, but I don't see any
interpretation whereby ς and σ should match when case is ignored.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16919
; Package
grep
.
(Mon, 03 Mar 2014 05:06:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 16919 <at> debbugs.gnu.org (full text, mbox):
Paul Eggert wrote:
> POSIX is muddy in this area, unfortunately, but I don't see any
> interpretation whereby ς and σ should match when case is ignored.
On second thought, I may have been too strict here. I suppose one could
interpret POSIX to say that since 'σ' == tolower (toupper ('ς')), that
it should be OK for the pattern 'σ' to match the string 'ς' when
ignoring case, even though the characters differ and are both lowercase.
Still, there's a problem that the dfa code behaves differently than the
regex code. One way to make them behave the same would be for the dfa
code to set *backref to 1 when ignoring case and when an alphabetic
character in the pattern is matched in a non-POSIX locale. This would
have performance consequences, though. Perhaps we can think of a better
way.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16919
; Package
grep
.
(Mon, 03 Mar 2014 23:58:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 16919 <at> debbugs.gnu.org (full text, mbox):
Paul Eggert wrote:
> On second thought, I may have been too strict here. I suppose one
> could interpret POSIX to say that since 'σ' == tolower (toupper ('?')),
> that it should be OK for the pattern 'σ' to match the string '?' when
> ignoring case, even though the characters differ and are both lowercase.
I don't think such. Final sigma is neither uppwercase nor lowercase of
sigma, as you have said firstly. In addition, it isn't even titlecase
of sigma.
Even if that behavior of regex is right, it won't work correctly for a
character that only one lowercase is assigned to two uppercases. (Though,
I don't know such a character.)
Norihiro
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16919
; Package
grep
.
(Tue, 04 Mar 2014 07:02:01 GMT)
Full text and
rfc822 format available.
Message #17 received at 16919 <at> debbugs.gnu.org (full text, mbox):
Norihiro Tanaka wrote:
> Final sigma is neither uppwercase nor lowercase of
> sigma, as you have said firstly. In addition, it isn't even titlecase
> of sigma.
That's what I thought, too. Certainly an implementation should be free
to say that stigma and sigma do not match when ignoring case. That is
how Solaris 11 /usr/xpg4/bin/grep -i behaves, for example. But it's not
clear that POSIX requires such an implementation.
> Even if that behavior of regex is right, it won't work correctly for a
> character that only one lowercase is assigned to two uppercases. (Though,
> I don't know such a character.)
Here's an example: 'İ' (U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE)
along with ASCII 'i' and 'I'. On GNU and Solaris platforms in the
en_US.UTF-8 locale, towlower (L'İ') == L'i', so 'i' has two uppercase
counterparts. Solaris 11 /usr/xpg4/bin/grep -i says that 'i' and 'I'
match each other but do not match 'İ', and reports a syntax error when
attempting to match 'İ' to itself:
$ echo 'İ' | /usr/xpg4/bin/grep -i 'İ'
grep: input file "(standard input)": line 1: syntax error
so Solaris 'grep' is clearly busted here. The GNU regex code says that
'i' and 'I' match only each other, and that 'İ' matches only itself.
But the dfa code in the current trunk behaves differently: it says that
the patterns 'i' and 'I' match each other, whereas the pattern 'İ'
matches itself and 'i'. So we have inconsistent behavior with the grep
current master 5aa1413:
$ printf 'İ\ni\nI\n' | grep -i 'İ' # Use the DFA code.
İ
i
$ printf 'İ\ni\nI\n' | grep -i '\(\)\1İ' # Use the regex code.
İ
Here the DFA behavior is more plausible; but the DFA code is supposed to
accelerate the regex code without changing behavior, so there is a problem.
After writing the above, I discovered a web page that talks about this
issue, written about 10 years ago and committed by Charles Levert:
http://www.gnu.org/software/grep/devel.html
Look for the string "POSIX and --ignore-case". The regex code takes the
"uc(input_wchar) == uc(pattern_wchar)" approach, whereas the DFA code
takes an approach not mentioned in that web page, namely "input_wchar ==
pattern_wchar || input_wchar == lc(pattern_wchar) || input_wchar ==
uc(pattern_wchar)".
I'm inclined to fix the DFA code so that it behaves like the regex code
even if the regex code isn't that nice, because arguably the regex code
does conform to POSIX and anyway its behavior is longstanding and I
doubt whether people care all that much about the details so long as the
two are consistent.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16919
; Package
grep
.
(Tue, 04 Mar 2014 19:38:03 GMT)
Full text and
rfc822 format available.
Message #20 received at 16919 <at> debbugs.gnu.org (full text, mbox):
On Mon, Mar 3, 2014 at 11:01 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
...
> I'm inclined to fix the DFA code so that it behaves like the regex code even
> if the regex code isn't that nice, because arguably the regex code does
> conform to POSIX and anyway its behavior is longstanding and I doubt whether
> people care all that much about the details so long as the two are
> consistent.
That is my preference, too. Thanks for digging into this.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16919
; Package
grep
.
(Wed, 05 Mar 2014 14:12:02 GMT)
Full text and
rfc822 format available.
Message #23 received at 16919 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hi Paul,
Thanks for a lot of investigation. I have understood that we cannot
generally tell whether DFA's or regex's behavior is right.
I have tested the behavior of sereral regex engines. What's interesting
is that most of results differ from others. And nobody will understand
which is right.
--
GNU grep (DFA):
$ env LANG=en_US.utf8 ./test.sh "src/grep -i" 2>/dev/null | nl -ba
1 c7 87 | c7 89
2 c7 87 | c7 88 | c7 89
3 c7 87 | c7 89
4 49 | 69
5 49 | 69
6 69 | c4 b0
7 49 | c4 b1
GNU grep (regex):
$ env LANG=en_US.utf8 ./test.sh "src/grep -i" '\(\)\1' 2>/dev/null | nl -ba
1 c7 87 | c7 88 | c7 89
2 c7 87 | c7 88 | c7 89
3 c7 87 | c7 88 | c7 89
4 49 | 69 | c4 b1
5 49 | 69 | c4 b1
6 c4 b0
7 49 | 69 | c4 b1
pcregrep:
$ env LANG=en_US.utf8 ./test.sh "pcregrep -iu" 2>/dev/null | nl -ba
1 c7 87 | c7 88 | c7 89
2 c7 87 | c7 88 | c7 89
3 c7 87 | c7 88 | c7 89
4 49 | 69
5 49 | 69
6 c4 b0
7 c4 b1
Solaris grep (xpg4):
$ env LANG=ja_JP.UTF-8 ./test.sh "/usr/xpg4/bin/grep -i" 2>/dev/null | nl -ba
1 c7 87 | c7 89
2 c7 88
3 c7 87 | c7 89
4 49 | 69
5 49 | 69
6 error
7 error
HP-UX grep:
$ env LANG=en_US.utf8 ./test.sh "/bin/grep -i" 2>/dev/null | nl -ba
1 c7 87 | c7 88 | c7 89
2 c7 87 | c7 88 | c7 89
3 c7 87 | c7 88 | c7 89
4 49 | 69
5 49 | 69
6 c4 b0
7 c4 b1
--
Norihiro
[test.sh (application/octet-stream, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16919
; Package
grep
.
(Wed, 05 Mar 2014 15:12:02 GMT)
Full text and
rfc822 format available.
Message #26 received at 16919 <at> debbugs.gnu.org (full text, mbox):
Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> And nobody will understand which is right.
However, I still believe that upper or lower case of a character should
also match title case, because I think that title case is extension of
cases (such as upper or lower), and furthermore they also matches title
case (though it may be by chance) in regex.
Norihiro
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16919
; Package
grep
.
(Wed, 05 Mar 2014 18:51:03 GMT)
Full text and
rfc822 format available.
Message #29 received at 16919 <at> debbugs.gnu.org (full text, mbox):
On 03/05/2014 07:11 AM, Norihiro Tanaka wrote:
> I still believe that upper or lower case of a character should
> also match title case
The (soon-to-be-fixed) gnulib regex code agrees with you, assuming that
towupper (X) agrees for all three values of X, because it uses (towupper
(input) == towupper (pattern)). However, the most-plausible reading of
POSIX does not agree with you, as it would require (input == pattern ||
towlower (input) == pattern || towupper (input) == pattern), which means
a titlecase pattern will match only itself.
It seems pretty clear to me that the most-plausible reading of POSIX is
buggy, for this reason. No wonder so many implementations fail to
conform to it.
I thought of a different way where gnulib/glibc regex does not conform
to POSIX, and here there doesn't seem to be any ambiguity about it. In
the POSIX locale when ignoring case, the pattern '[Z-a]' matches the
data 'Z', 'z', 'A', 'a', and the nonalphabetic characters like '^' that
collate between 'Z' and 'a'. But the glibc regex code rejects that
pattern entirely. Conversely, in the same situation the glibc regex
code says '[A-z]' matches only alphabetic characters, whereas POSIX says
it should also match the nonalphabetic characters like '^' that collate
between 'Z' and 'a'. It appears that nobody cares, as this
incompatibility has been present for years and I don't recall anyone
complaining. Though it is weird that this means "grep PAT" can match
some lines that "grep -i PAT" doesn't.
Here POSIX is not merely ambiguous, it's clearly disagreeing with common
practice. It's not clear whether the bug is in POSIX or in the
implementation.
Reply sent
to
Paul Eggert <eggert <at> cs.ucla.edu>
:
You have taken responsibility.
(Sat, 08 Mar 2014 02:34:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Norihiro Tanaka <noritnk <at> kcn.ne.jp>
:
bug acknowledged by developer.
(Sat, 08 Mar 2014 02:34:02 GMT)
Full text and
rfc822 format available.
Message #34 received at 16919-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Jim Meyering wrote:
> That is my preference, too.
OK, thanks, I installed the attached patch, which should do just that,
and I'm marking this bug as done. I'll CC: this to Aharon since it
might affect Gawk.
The code still needs cleanup but so far I'm just trying to fix bugs.
[0001-grep-fix-case-fold-mismatches-between-DFA-and-regex.patch (text/plain, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16919
; Package
grep
.
(Sat, 08 Mar 2014 03:24:02 GMT)
Full text and
rfc822 format available.
Message #37 received at 16919-done <at> debbugs.gnu.org (full text, mbox):
On Fri, Mar 7, 2014 at 6:33 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> Jim Meyering wrote:
>>
>> That is my preference, too.
>
>
> OK, thanks, I installed the attached patch, which should do just that, and
> I'm marking this bug as done.
Impressive. After a first pass, I'd say it's perfect.
Thanks a lot for all that fine work!
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16919
; Package
grep
.
(Sat, 08 Mar 2014 03:44:02 GMT)
Full text and
rfc822 format available.
Message #40 received at 16919-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
I noticed that fgrep disagreed with the regex code as well, so while I'm
in the neighborhood I applied the attached patch.
[0001-fgrep-fix-case-fold-incompatibility-with-plain-grep.patch (text/plain, attachment)]
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Sat, 05 Apr 2014 11:24:03 GMT)
Full text and
rfc822 format available.
This bug report was last modified 10 years and 29 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.