GNU bug report logs -
#17376
[PATCH] grep: fix the different behaviour for a invalid sequence between KWset and DFA
Previous Next
Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Date: Wed, 30 Apr 2014 15:03:01 UTC
Severity: normal
Tags: patch
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 17376 in the body.
You can then email your comments to 17376 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#17376
; Package
grep
.
(Wed, 30 Apr 2014 15:03:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Norihiro Tanaka <noritnk <at> kcn.ne.jp>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Wed, 30 Apr 2014 15:03:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Thare is different behaviour for a invalid sequence between KWset and DFA.
encode() { echo "$1" | tr ABC '\357\274\241'; }
encode ABC | env LC_ALL=en_US.utf8 src/grep "$(encode A)\|q"
encode ABC | env LC_ALL=en_US.utf8 src/grep -F "$(encode A)"
encode sABC | env LC_ALL=en_US.utf8 src/grep "a$(encode A)\|q"
encode sABC | env LC_ALL=en_US.utf8 src/grep -F "a$(encode A)"
We expect that all of them are same results, but only 4th returns 1 row.
This patch fixes it, changes all into 1 row returned.
Norihiro
[patch.txt (text/plain, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#17376
; Package
grep
.
(Wed, 30 Apr 2014 19:05:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 17376 <at> debbugs.gnu.org (full text, mbox):
On 04/30/2014 08:02 AM, Norihiro Tanaka wrote:
> Thare is different behaviour for a invalid sequence between KWset and DFA.
>
> encode() { echo "$1" | tr ABC '\357\274\241'; }
> encode ABC | env LC_ALL=en_US.utf8 src/grep "$(encode A)\|q"
> encode ABC | env LC_ALL=en_US.utf8 src/grep -F "$(encode A)"
> encode sABC | env LC_ALL=en_US.utf8 src/grep "a$(encode A)\|q"
> encode sABC | env LC_ALL=en_US.utf8 src/grep -F "a$(encode A)"
>
> We expect that all of them are same results, but only 4th returns 1 row.
Sorry, but I am not observing this behavior. With grep 2.18, none of
the commands output anything. The same is true for the git master.
If the pattern or data have encoding errors, POSIX says grep can do
whatever it likes. As I understand it, in grep 2.18 and the git master,
an encoding-error byte in a pattern matches only the same encoding-error
byte in the data. Does this bug report's patch change behavior, so that
an encoding-error byte in a pattern can match part of a valid
multibyte-character in the data? If so, it's not clear to me why the
proposed behavior change is helpful -- as a user, I'm not sure I'd want
such a match to work. If not, then could you please explain a bit more
what's going on?
More generally, I don't think users care about encoding-error bytes in
patterns. If it helps simplify the code and/or improves performance,
I'd favor changing 'grep' so that it simply rejects patterns containing
encoding errors, and exits with status 2. We should probably wait until
after the next release before doing anything that drastic, though.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#17376
; Package
grep
.
(Wed, 30 Apr 2014 22:17:01 GMT)
Full text and
rfc822 format available.
Message #11 received at 17376 <at> debbugs.gnu.org (full text, mbox):
Sorry, tow test cases are wrong. It's as follows surely.
encode() { echo "$1" | tr ABC '\357\274\241'; }
encode ABC | env LC_ALL=en_US.utf8 src/grep "$(encode A)\|q"
encode ABC | env LC_ALL=en_US.utf8 src/grep -F "$(encode A)"
encode aABC | env LC_ALL=en_US.utf8 src/grep "a$(encode A)\|q"
^
encode aABC | env LC_ALL=en_US.utf8 src/grep -F "a$(encode A)"
^
We will expect none of the commands output anything, but we get 1 row in
4th. We need to fix last line in searchutils.c (is_mb_middle) to make
it correctly.
return 0 < match_len && match_len < mbrlen (p, end - p, &cur_state);
We must check whether it's valid at not the top but a part of last of
matched pattern. Now, although checked here: `a$(encode A)', correctly
should be checked here: `a$(encode A)'. ^
^^^^^^^^^^^
However, it may cause slowdown in some typical cases which doesn't include
any invalid sequences, and many users won't hope it.
Further more, I seem that DFA doesn't treat invalid sequence correctly.
I checked it with debugging on. No longer tokens are broken in 1st case.
encode ABC | env LC_ALL=en_US.utf8 src/grep "$(encode A)\|q"
dfaanalyze:
0:c3 1:af 2:CAT 3:71 4:OR 5:END 6:CAT
I expect below, becuase `encode ABC' is `ef bc a1'.
dfaanalyze:
0:ef 1:71 2:OR 3:END 4:CAT
However, It will be also difficult to fix it correctly. Therefore,
I propose the simple fix in the patch.
Thanks,
Norihiro
Information forwarded
to
bug-grep <at> gnu.org
:
bug#17376
; Package
grep
.
(Thu, 01 May 2014 04:12:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 17376 <at> debbugs.gnu.org (full text, mbox):
Thanks, I'm starting to see now. Unfortunately when I apply the
proposed patch in http://bugs.gnu.org/17376#5 to the current grep master
10ce910928c757d1adb790a9c9e5bad7de289bd4, 'make check' fails for the
tests invalid-multibyte-infloop and fgrep-infloop. Do you observe these
two test failures on your platform? If not, I can try to look into the
problem some more here.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#17376
; Package
grep
.
(Thu, 01 May 2014 11:50:02 GMT)
Full text and
rfc822 format available.
Message #17 received at 17376 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Paul Eggert wrote:
> Unfortunately when I apply the proposed patch in http://bugs.gnu.org/17376#5
> to the current grep master 10ce910928c757d1adb790a9c9e5bad7de289bd4,
> 'make check' fails for the tests invalid-multibyte-infloop and fgrep-infloop.
Sorry, They're skipped because my platform is too old and timeout program
doesn't exist. I fixed them.
[patch.txt (text/plain, attachment)]
Reply sent
to
Paul Eggert <eggert <at> cs.ucla.edu>
:
You have taken responsibility.
(Mon, 05 May 2014 02:42:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Norihiro Tanaka <noritnk <at> kcn.ne.jp>
:
bug acknowledged by developer.
(Mon, 05 May 2014 02:42:03 GMT)
Full text and
rfc822 format available.
Message #22 received at 17376-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Thanks for that fix. I found a few glitches, mostly having to do with
storing WEOF in a wchar_t, which is not portable. There were also some
opportunities for simplification and clarification. I installed the
patch, and followed it up with the attached three patches. The second
patch is the only one that fixes any bugs.
[0001-tests-improve-coverage-for-prefix-of-multibyte.patch (text/plain, attachment)]
[0002-grep-simplify-and-fix-problems-with-KWset-DFA-agreem.patch (text/plain, attachment)]
[0003-dfa-minor-simplification.patch (text/plain, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#17376
; Package
grep
.
(Tue, 06 May 2014 03:27:02 GMT)
Full text and
rfc822 format available.
Message #25 received at 17376 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
While thinking about Bug#17376 I noticed some related bugs, which appear
to have been in 'grep' since at least grep 2.0. For example:
$ encode() { echo "$1" | tr ABC '\357\274\241'; }
$ encode ABCABC >exp3
$ encode _____________________ABCABC___ >exp4
$ bca=$(encode BCA)
$ grep "$bca" exp3
$ grep -F "$bca" exp3
$ grep "\\(\\)\\1$bca" exp3
AA
Here the regexp code disagrees with KWset and with the DFA, which is a
bug: KWset and DFA should affect only performance, not behavior.
$ grep "$bca" exp4
_____________________AA___
$ grep -F "$bca" exp4
_____________________AA___
$ grep "\\(\\)\\1$bca" exp4
_____________________AA___
Here they agree, but only because there's a bug in is_mb_middle!
Fixing that will cause them to disagree again.
I installed the attached patch to fix the bugs I found, and to adjust
the test cases accordingly.
[0001-grep-fix-encoding-error-incompatibilities-among-rege.patch (text/plain, attachment)]
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Tue, 03 Jun 2014 11:24:04 GMT)
Full text and
rfc822 format available.
This bug report was last modified 9 years and 327 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.