GNU bug report logs - #19306
[PATCH 1/2] dfa: avoid execution for a pattern including an unsupported expression

Previous Next

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Mon, 8 Dec 2014 15:26:01 UTC

Severity: normal

Tags: patch

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 19306 in the body.
You can then email your comments to 19306 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#19306; Package grep. (Mon, 08 Dec 2014 15:26:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Mon, 08 Dec 2014 15:26:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: <bug-grep <at> gnu.org>
Subject: [PATCH 1/2] dfa: avoid execution for a pattern including an
 unsupported expression
Date: Tue, 09 Dec 2014 00:24:38 +0900
[Message part 1 (text/plain, inline)]
If a pattern includes an unsupported expression which is labeled as
BACKREF, a text will almost fail to match the pattern.

First patch makes changes to return immediately and avoid pidding search
if a pattern includes an unsupported expression.

Second patch removes word constraint support which is as labeled BEGWORD,
ENDWORD, LIMWORD and NOTLIMWORD for multibyte-locales including UTF-8
from DFA, as we use regex instead of DFA for a pattern including word
constraint so that it does not correctly treat word constraint.
[0001-dfa-avoid-execution-for-a-pattern-including-an-unsup.patch (text/plain, attachment)]
[0002-dfa-remove-a-support-of-a-word-delimiter-expression-.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#19306; Package grep. (Sun, 19 Jul 2015 05:17:01 GMT) Full text and rfc822 format available.

Message #8 received at 19306 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, 19306 <at> debbugs.gnu.org
Subject: Re: [PATCH 1/2] dfa: avoid execution for a pattern including an
 unsupported expression
Date: Sat, 18 Jul 2015 22:15:33 -0700
Hello,
Thank you for the patches in this report:

  http://bugs.gnu.org/19306

Please excuse my delay in getting back to you on this.
Would you revise each of those to include a test case
that demonstrates the problem/fix?




Information forwarded to bug-grep <at> gnu.org:
bug#19306; Package grep. (Sun, 19 Jul 2015 07:44:02 GMT) Full text and rfc822 format available.

Message #11 received at 19306 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Jim Meyering <jim <at> meyering.net>
Cc: 19306 <at> debbugs.gnu.org
Subject: Re: [PATCH 1/2] dfa: avoid execution for a pattern including an
 unsupported expression
Date: Sun, 19 Jul 2015 16:42:46 +0900
[Message part 1 (text/plain, inline)]
On Sat, 18 Jul 2015 22:15:33 -0700
Jim Meyering <jim <at> meyering.net> wrote:

> Hello,
> Thank you for the patches in this report:
> 
>   http://bugs.gnu.org/19306
> 
> Please excuse my delay in getting back to you on this.
> Would you revise each of those to include a test case
> that demonstrates the problem/fix?

Thanks for your reviewing of this report.

This is not bug fix.  It avoids that BACKREF is found in the process of
DFAEXEC and passed to regex in multibyte locale.  In other words, if a
pattern includes BACKREF, grep does not try to use DFA from the
beginning.

I confirmed about 10% speed-up for a test case in attachment.

  Before patching: real 7.29  user 7.26  sys 0.02
  After patching : real 6.57  user 6.55  sys 0.01

KWset and DFA superset succeeds for all rows in the test case, and DFA
for multibyte succeeds, too.  However, all rows are rejected in regex.

After patching, grep does not try DFA for multibyte, as pattern includes
BACKREF.

In addtion, I believe that DFA is simplified by removal of handling for
BACKREF from dfaanalyze(), dfassbuild() and dfaexec().
[testcase.sh (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#19306; Package grep. (Sun, 19 Jul 2015 11:25:02 GMT) Full text and rfc822 format available.

Message #14 received at 19306 <at> debbugs.gnu.org (full text, mbox):

From: arnold <at> skeeve.com
To: noritnk <at> kcn.ne.jp, jim <at> meyering.net
Cc: 19306 <at> debbugs.gnu.org
Subject: Re: bug#19306: [PATCH 1/2] dfa: avoid execution for a pattern
 including an unsupported expression
Date: Sun, 19 Jul 2015 05:24:01 -0600
Hi.

This looks like a nice patch that gawk would beneifit from.

I have a minor suggestion, which is to make dfabackref into a
static function.

Thanks,

Arnold
----------------------
Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:

> On Sat, 18 Jul 2015 22:15:33 -0700
> Jim Meyering <jim <at> meyering.net> wrote:
>
> > Hello,
> > Thank you for the patches in this report:
> > 
> >   http://bugs.gnu.org/19306
> > 
> > Please excuse my delay in getting back to you on this.
> > Would you revise each of those to include a test case
> > that demonstrates the problem/fix?
>
> Thanks for your reviewing of this report.
>
> This is not bug fix.  It avoids that BACKREF is found in the process of
> DFAEXEC and passed to regex in multibyte locale.  In other words, if a
> pattern includes BACKREF, grep does not try to use DFA from the
> beginning.
>
> I confirmed about 10% speed-up for a test case in attachment.
>
>   Before patching: real 7.29  user 7.26  sys 0.02
>   After patching : real 6.57  user 6.55  sys 0.01
>
> KWset and DFA superset succeeds for all rows in the test case, and DFA
> for multibyte succeeds, too.  However, all rows are rejected in regex.
>
> After patching, grep does not try DFA for multibyte, as pattern includes
> BACKREF.
>
> In addtion, I believe that DFA is simplified by removal of handling for
> BACKREF from dfaanalyze(), dfassbuild() and dfaexec().




Information forwarded to bug-grep <at> gnu.org:
bug#19306; Package grep. (Sun, 19 Jul 2015 14:56:01 GMT) Full text and rfc822 format available.

Message #17 received at 19306 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: 19306 <at> debbugs.gnu.org
Cc: arnold <at> skeeve.com, jim <at> meyering.net
Subject: Re: bug#19306: [PATCH 1/2] dfa: avoid execution for a pattern
 including an unsupported expression
Date: Sun, 19 Jul 2015 23:55:15 +0900
[Message part 1 (text/plain, inline)]
On Sun, 19 Jul 2015 05:24:01 -0600
arnold <at> skeeve.com wrote:

> I have a minor suggestion, which is to make dfabackref into a
> static function.

dfabackref() is not called from external, so it should be static, as you
say.  I fixed it.

Thanks.
[0001-dfa-avoid-execution-for-a-pattern-including-an-unsup.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#19306; Package grep. (Mon, 20 Jul 2015 03:16:02 GMT) Full text and rfc822 format available.

Message #20 received at 19306 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 19306 <at> debbugs.gnu.org
Subject: Re: [PATCH 1/2] dfa: avoid execution for a pattern including an
 unsupported expression
Date: Sun, 19 Jul 2015 20:14:52 -0700
On Sun, Jul 19, 2015 at 12:42 AM, Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> On Sat, 18 Jul 2015 22:15:33 -0700
> Jim Meyering <jim <at> meyering.net> wrote:
>
>> Hello,
>> Thank you for the patches in this report:
>>
>>   http://bugs.gnu.org/19306
>>
>> Please excuse my delay in getting back to you on this.
>> Would you revise each of those to include a test case
>> that demonstrates the problem/fix?
>
> Thanks for your reviewing of this report.
>
> This is not bug fix.  It avoids that BACKREF is found in the process of
> DFAEXEC and passed to regex in multibyte locale.  In other words, if a
> pattern includes BACKREF, grep does not try to use DFA from the
> beginning.
>
> I confirmed about 10% speed-up for a test case in attachment.
>
>   Before patching: real 7.29  user 7.26  sys 0.02
>   After patching : real 6.57  user 6.55  sys 0.01
>
> KWset and DFA superset succeeds for all rows in the test case, and DFA
> for multibyte succeeds, too.  However, all rows are rejected in regex.
>
> After patching, grep does not try DFA for multibyte, as pattern includes
> BACKREF.
>
> In addtion, I believe that DFA is simplified by removal of handling for
> BACKREF from dfaanalyze(), dfassbuild() and dfaexec().

Thank you for the additional information and the test script.
I like most of this patch, but not the fact that it causes the
word-delim-multibyte test to fail. I do see that also applying your
following patch makes that test pass once again.  However, it does so
at the cost of forcing a new class of regexps (any that contain a use
of \b, \<  or \>) from DFA into the slower regex matcher.

That feels like too large a performance penalty, in general.

Can you quantify it?




Information forwarded to bug-grep <at> gnu.org:
bug#19306; Package grep. (Mon, 20 Jul 2015 15:15:02 GMT) Full text and rfc822 format available.

Message #23 received at 19306 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Jim Meyering <jim <at> meyering.net>
Cc: 19306 <at> debbugs.gnu.org
Subject: Re: [PATCH 1/2] dfa: avoid execution for a pattern including an
 unsupported expression
Date: Tue, 21 Jul 2015 00:14:25 +0900
On Sun, 19 Jul 2015 20:14:52 -0700
Jim Meyering <jim <at> meyering.net> wrote:

> Thank you for the additional information and the test script.
> I like most of this patch, but not the fact that it causes the
> word-delim-multibyte test to fail. I do see that also applying your
> following patch makes that test pass once again.  However, it does so
> at the cost of forcing a new class of regexps (any that contain a use
> of \b, \<  or \>) from DFA into the slower regex matcher.

I think DFA forces regex for BEGWORD, LIMWORD, ENDWORD, instead of
whether patching or not.  Could you remark code in dfassbuild() without
patching?  It seem that DFA rejects their words from before.

        case BEGWORD:
        case ENDWORD:
        case LIMWORD:
        case NOTLIMWORD:
          if (d->multibyte)
            {
              /* These constraints aren't supported in a multibyte locale.
                 Ignore them in the superset DFA, and treat them as
                 backreferences in the main DFA.  */
              sup->tokens[j++] = EMPTY;
              d->tokens[i] = BACKREF;  <<<<
              break;
            }

DFA does not handle word context in multibyte correctly.  Perhaps, if we
fix it, DFA will take a performance penalty.





Information forwarded to bug-grep <at> gnu.org:
bug#19306; Package grep. (Sun, 26 Jul 2015 16:11:01 GMT) Full text and rfc822 format available.

Message #26 received at 19306 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 19306 <at> debbugs.gnu.org
Subject: Re: [PATCH 1/2] dfa: avoid execution for a pattern including an
 unsupported expression
Date: Sun, 26 Jul 2015 09:10:05 -0700
[Message part 1 (text/plain, inline)]
On Mon, Jul 20, 2015 at 8:14 AM, Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
>
> On Sun, 19 Jul 2015 20:14:52 -0700
> Jim Meyering <jim <at> meyering.net> wrote:
>
>> Thank you for the additional information and the test script.
>> I like most of this patch, but not the fact that it causes the
>> word-delim-multibyte test to fail. I do see that also applying your
>> following patch makes that test pass once again.  However, it does so
>> at the cost of forcing a new class of regexps (any that contain a use
>> of \b, \<  or \>) from DFA into the slower regex matcher.
>
> I think DFA forces regex for BEGWORD, LIMWORD, ENDWORD, instead of
> whether patching or not.  Could you remark code in dfassbuild() without
> patching?  It seem that DFA rejects their words from before.
>
>         case BEGWORD:
>         case ENDWORD:
>         case LIMWORD:
>         case NOTLIMWORD:
>           if (d->multibyte)
>             {
>               /* These constraints aren't supported in a multibyte locale.
>                  Ignore them in the superset DFA, and treat them as
>                  backreferences in the main DFA.  */
>               sup->tokens[j++] = EMPTY;
>               d->tokens[i] = BACKREF;  <<<<
>               break;
>             }
>
> DFA does not handle word context in multibyte correctly.  Perhaps, if we
> fix it, DFA will take a performance penalty.

You're right. That covers it.

This patch is good also for eliminating some technical debt.
Since your change eliminates the code below (effectively making
the same change from conjunct to disjunct), there is no reason
to make the following correction:

               /* Falling back to the glibc matcher in this case gives   \
                  better performance (up to 25% better on [a-z], for     \
                  example) and enables support for collating symbols and \
                  equivalence classes.  */                               \
-              if (d->states[s].has_mbcset && backref)                   \
+              if (d->states[s].has_mbcset || backref)                   \
                 {                                                       \
                   *backref = 1;                                         \
                   goto done;                                            \
                 }                                                       \

At first I was surprised to see that using "&&" there provoked
no test failure, but then saw that we test "backref" shortly thereafter.
While technically, using "&&" there is a bug, it seems the effect was
negligible.

I have made some changes, renaming functions, e.g., dfabackref -> dfa_supported,
since even before this change "backref" meant more than the presence
of a backreference.

Note that while your commit log entry described new functions, I have
modified the commit
log to say merely "new function" for each. Instead, I document the new
functions in the code,
where that documentation will be more useful, and maintained.

Please let me know if there is anything you'd like to change:
[0001-dfa-avoid-execution-for-a-pattern-including-an-unsup.patch (text/x-patch, attachment)]
[0002-dfa-remove-word-delimiter-support-for-multibyte-loca.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#19306; Package grep. (Sun, 26 Jul 2015 22:30:03 GMT) Full text and rfc822 format available.

Message #29 received at 19306 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Jim Meyering <jim <at> meyering.net>
Cc: 19306 <at> debbugs.gnu.org
Subject: Re: [PATCH 1/2] dfa: avoid execution for a pattern including an
 unsupported expression
Date: Mon, 27 Jul 2015 07:29:42 +0900
On Sun, 26 Jul 2015 09:10:05 -0700
Jim Meyering <jim <at> meyering.net> wrote:

> You're right. That covers it.
> 
> This patch is good also for eliminating some technical debt.
> Since your change eliminates the code below (effectively making
> the same change from conjunct to disjunct), there is no reason
> to make the following correction:
> 
>                /* Falling back to the glibc matcher in this case gives   \
>                   better performance (up to 25% better on [a-z], for     \
>                   example) and enables support for collating symbols and \
>                   equivalence classes.  */                               \
> -              if (d->states[s].has_mbcset && backref)                   \
> +              if (d->states[s].has_mbcset || backref)                   \
>                  {                                                       \
>                    *backref = 1;                                         \
>                    goto done;                                            \
>                  }                                                       \
> 
> At first I was surprised to see that using "&&" there provoked
> no test failure, but then saw that we test "backref" shortly thereafter.
> While technically, using "&&" there is a bug, it seems the effect was
> negligible.
> 
> I have made some changes, renaming functions, e.g., dfabackref -> dfa_supported,
> since even before this change "backref" meant more than the presence
> of a backreference.
> 
> Note that while your commit log entry described new functions, I have
> modified the commit
> log to say merely "new function" for each. Instead, I document the new
> functions in the code,
> where that documentation will be more useful, and maintained.
> 
> Please let me know if there is anything you'd like to change:

Thanks for reviewing.  

On Sun, 26 Jul 2015 09:10:05 -0700
Jim Meyering <jim <at> meyering.net> wrote:

> This patch is good also for eliminating some technical debt.
> Since your change eliminates the code below (effectively making
> the same change from conjunct to disjunct), there is no reason
> to make the following correction:
>
>                /* Falling back to the glibc matcher in this case gives   \
>                   better performance (up to 25% better on [a-z], for     \
>                   example) and enables support for collating symbols and \
>                   equivalence classes.  */                               \
> -              if (d->states[s].has_mbcset && backref)                   \
> +              if (d->states[s].has_mbcset || backref)                   \
>                  {                                                       \
>                    *backref = 1;                                         \
>                    goto done;                                            \
>                  }                                                       \
> 
> At first I was surprised to see that using "&&" there provoked
> no test failure, but then saw that we test "backref" shortly thereafter.
> While technically, using "&&" there is a bug, it seems the effect was
> negligible.

The change is wrong, and previous code is right.  If BACKREF is null,
may be core dumped at *BACKREF = 1.

> I have made some changes, renaming functions, e.g., dfabackref -> dfa_supported,
> since even before this change "backref" meant more than the presence
> of a backreference.

I agree.

By the way, now BACKREF is used in meaning of non-support already.  So
In the near future, Maybe it should be changed into UNSUPPORTED etc.

> Note that while your commit log entry described new functions, I have
> modified the commit
> log to say merely "new function" for each. Instead, I document the new
> functions in the code,
> where that documentation will be more useful, and maintained.
> 
> Please let me know if there is anything you'd like to change:

Thanks for ajustment of commit log.  I have no request to change.  I
agree to the changes.





Reply sent to Jim Meyering <jim <at> meyering.net>:
You have taken responsibility. (Sun, 26 Jul 2015 23:42:02 GMT) Full text and rfc822 format available.

Notification sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
bug acknowledged by developer. (Sun, 26 Jul 2015 23:42:02 GMT) Full text and rfc822 format available.

Message #34 received at 19306-done <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 19306-done <at> debbugs.gnu.org
Subject: Re: [PATCH 1/2] dfa: avoid execution for a pattern including an
 unsupported expression
Date: Sun, 26 Jul 2015 16:41:34 -0700
pushed.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 24 Aug 2015 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 8 years and 247 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.