GNU bug report logs - #17376
[PATCH] grep: fix the different behaviour for a invalid sequence between KWset and DFA

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Wed, 30 Apr 2014 15:03:01 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 17376 in the body.
You can then email your comments to 17376 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-grep <at> gnu.org:
bug#17376; Package grep. (Wed, 30 Apr 2014 15:03:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Wed, 30 Apr 2014 15:03:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: bug-grep <at> gnu.org
Subject: [PATCH] grep: fix the different behaviour for a invalid sequence
 between KWset and DFA
Date: Thu, 01 May 2014 00:02:19 +0900

[Message part 1 (text/plain, inline)]

Thare is different behaviour for a invalid sequence between KWset and DFA.

  encode() { echo "$1" | tr ABC '\357\274\241'; }
  encode ABC | env LC_ALL=en_US.utf8 src/grep "$(encode A)\|q"
  encode ABC | env LC_ALL=en_US.utf8 src/grep -F "$(encode A)"
  encode sABC | env LC_ALL=en_US.utf8 src/grep "a$(encode A)\|q"
  encode sABC | env LC_ALL=en_US.utf8 src/grep -F "a$(encode A)"

We expect that all of them are same results, but only 4th returns 1 row.
This patch fixes it, changes all into 1 row returned.

Norihiro

[patch.txt (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#17376; Package grep. (Wed, 30 Apr 2014 19:05:02 GMT) Full text and rfc822 format available.

Message #8 received at 17376 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, 17376 <at> debbugs.gnu.org
Subject: Re: bug#17376: [PATCH] grep: fix the different behaviour for a invalid
 sequence between KWset and DFA
Date: Wed, 30 Apr 2014 12:04:50 -0700

On 04/30/2014 08:02 AM, Norihiro Tanaka wrote:
> Thare is different behaviour for a invalid sequence between KWset and DFA.
>
>    encode() { echo "$1" | tr ABC '\357\274\241'; }
>    encode ABC | env LC_ALL=en_US.utf8 src/grep "$(encode A)\|q"
>    encode ABC | env LC_ALL=en_US.utf8 src/grep -F "$(encode A)"
>    encode sABC | env LC_ALL=en_US.utf8 src/grep "a$(encode A)\|q"
>    encode sABC | env LC_ALL=en_US.utf8 src/grep -F "a$(encode A)"
>
> We expect that all of them are same results, but only 4th returns 1 row.

Sorry, but I am not observing this behavior.  With grep 2.18, none of 
the commands output anything.  The same is true for the git master.

If the pattern or data have encoding errors, POSIX says grep can do 
whatever it likes.  As I understand it, in grep 2.18 and the git master, 
an encoding-error byte in a pattern matches only the same encoding-error 
byte in the data.  Does this bug report's patch change behavior, so that 
an encoding-error byte in a pattern can match part of a valid 
multibyte-character in the data?  If so, it's not clear to me why the 
proposed behavior change is helpful -- as a user, I'm not sure I'd want 
such a match to work.  If not, then could you please explain a bit more 
what's going on?

More generally, I don't think users care about encoding-error bytes in 
patterns.  If it helps simplify the code and/or improves performance, 
I'd favor changing 'grep' so that it simply rejects patterns containing 
encoding errors, and exits with status 2.  We should probably wait until 
after the next release before doing anything that drastic, though.

Information forwarded to bug-grep <at> gnu.org:
bug#17376; Package grep. (Wed, 30 Apr 2014 22:17:01 GMT) Full text and rfc822 format available.

Message #11 received at 17376 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 17376 <at> debbugs.gnu.org
Subject: bug#17376: [PATCH] grep: fix the different behaviour for a invalid
 sequence between KWset and DFA
Date: Thu, 01 May 2014 07:16:31 +0900

Sorry, tow test cases are wrong.  It's as follows surely.

  encode() { echo "$1" | tr ABC '\357\274\241'; }
  encode ABC | env LC_ALL=en_US.utf8 src/grep "$(encode A)\|q"
  encode ABC | env LC_ALL=en_US.utf8 src/grep -F "$(encode A)"
  encode aABC | env LC_ALL=en_US.utf8 src/grep "a$(encode A)\|q"
         ^
  encode aABC | env LC_ALL=en_US.utf8 src/grep -F "a$(encode A)"
         ^

We will expect none of the commands output anything, but we get 1 row in
4th.  We need to fix last line in searchutils.c (is_mb_middle) to make
it correctly.

  return 0 < match_len && match_len < mbrlen (p, end - p, &cur_state);

We must check whether it's valid at not the top but a part of last of
matched pattern.  Now, although checked here: `a$(encode A)', correctly
should be checked here: `a$(encode A)'.        ^
                          ^^^^^^^^^^^
However, it may cause slowdown in some typical cases which doesn't include
any invalid sequences, and many users won't hope it.

Further more, I seem that DFA doesn't treat invalid sequence correctly.
I checked it with debugging on.  No longer tokens are broken in 1st case.

  encode ABC | env LC_ALL=en_US.utf8 src/grep "$(encode A)\|q"

  dfaanalyze:
   0:c3 1:af 2:CAT 3:71 4:OR 5:END 6:CAT

I expect below, becuase `encode ABC' is `ef bc a1'.

  dfaanalyze:
    0:ef 1:71 2:OR 3:END 4:CAT

However, It will be also difficult to fix it correctly.  Therefore,
I propose the simple fix in the patch.

Thanks,
Norihiro

Information forwarded to bug-grep <at> gnu.org:
bug#17376; Package grep. (Thu, 01 May 2014 04:12:02 GMT) Full text and rfc822 format available.

Message #14 received at 17376 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 17376 <at> debbugs.gnu.org
Subject: Re: bug#17376: [PATCH] grep: fix the different behaviour for a invalid
 sequence between KWset and DFA
Date: Wed, 30 Apr 2014 21:11:27 -0700

Thanks, I'm starting to see now.  Unfortunately when I apply the 
proposed patch in http://bugs.gnu.org/17376#5 to the current grep master 
10ce910928c757d1adb790a9c9e5bad7de289bd4, 'make check' fails for the 
tests invalid-multibyte-infloop and fgrep-infloop.  Do you observe these 
two test failures on your platform?  If not, I can try to look into the 
problem some more here.

Information forwarded to bug-grep <at> gnu.org:
bug#17376; Package grep. (Thu, 01 May 2014 11:50:02 GMT) Full text and rfc822 format available.

Message #17 received at 17376 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 17376 <at> debbugs.gnu.org
Subject: bug#17376: [PATCH] grep: fix the different behaviour for a invalid
 sequence between KWset and DFA
Date: Thu, 01 May 2014 20:49:05 +0900

[Message part 1 (text/plain, inline)]

Paul Eggert wrote:
> Unfortunately when I apply the proposed patch in http://bugs.gnu.org/17376#5
> to the current grep master 10ce910928c757d1adb790a9c9e5bad7de289bd4,
> 'make check' fails for the tests invalid-multibyte-infloop and fgrep-infloop.

Sorry, They're skipped because my platform is too old and timeout program
doesn't exist.  I fixed them.

[patch.txt (text/plain, attachment)]

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Mon, 05 May 2014 02:42:02 GMT) Full text and rfc822 format available.

Notification sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
bug acknowledged by developer. (Mon, 05 May 2014 02:42:03 GMT) Full text and rfc822 format available.

Message #22 received at 17376-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 17376-done <at> debbugs.gnu.org
Subject: Re: bug#17376: [PATCH] grep: fix the different behaviour for a invalid
 sequence between KWset and DFA
Date: Sun, 04 May 2014 19:41:29 -0700

[Message part 1 (text/plain, inline)]

Thanks for that fix.  I found a few glitches, mostly having to do with 
storing WEOF in a wchar_t, which is not portable.  There were also some 
opportunities for simplification and clarification.  I installed the 
patch, and followed it up with the attached three patches.  The second 
patch is the only one that fixes any bugs.

[0001-tests-improve-coverage-for-prefix-of-multibyte.patch (text/plain, attachment)]

[0002-grep-simplify-and-fix-problems-with-KWset-DFA-agreem.patch (text/plain, attachment)]

[0003-dfa-minor-simplification.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#17376; Package grep. (Tue, 06 May 2014 03:27:02 GMT) Full text and rfc822 format available.

Message #25 received at 17376 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 17376 <at> debbugs.gnu.org
Subject: Re: bug#17376: [PATCH] grep: fix the different behaviour for a invalid
 sequence between KWset and DFA
Date: Mon, 05 May 2014 20:26:37 -0700

[Message part 1 (text/plain, inline)]

While thinking about Bug#17376 I noticed some related bugs, which appear 
to have been in 'grep' since at least grep 2.0.  For example:

$ encode() { echo "$1" | tr ABC '\357\274\241'; }
$ encode ABCABC >exp3
$ encode _____________________ABCABC___ >exp4
$ bca=$(encode BCA)
$ grep "$bca" exp3
$ grep -F "$bca" exp3
$ grep "\\(\\)\\1$bca" exp3
ＡＡ

Here the regexp code disagrees with KWset and with the DFA, which is a 
bug: KWset and DFA should affect only performance, not behavior.

$ grep "$bca" exp4
_____________________ＡＡ___
$ grep -F "$bca" exp4
_____________________ＡＡ___
$ grep "\\(\\)\\1$bca" exp4
_____________________ＡＡ___

Here they agree, but only because there's a bug in is_mb_middle!
Fixing that will cause them to disagree again.

I installed the attached patch to fix the bugs I found, and to adjust 
the test cases accordingly.

[0001-grep-fix-encoding-error-incompatibilities-among-rege.patch (text/plain, attachment)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 03 Jun 2014 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 9 years and 327 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #17376 [PATCH] grep: fix the different behaviour for a invalid sequence between KWset and DFA

GNU bug report logs - #17376
[PATCH] grep: fix the different behaviour for a invalid sequence between KWset and DFA