GNU bug report logs - #22239
fgrep -i slow in 2.21

Package: grep;

Reported by: Ondřej Cífka <ondra <at> cifka.com>

Date: Fri, 25 Dec 2015 22:46:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 22239 in the body.
You can then email your comments to 22239 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-grep <at> gnu.org:
bug#22239; Package grep. (Fri, 25 Dec 2015 22:46:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Ondřej Cífka <ondra <at> cifka.com>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Fri, 25 Dec 2015 22:46:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Ondřej Cífka <ondra <at> cifka.com>
To: bug-grep <at> gnu.org
Subject: fgrep -i slow in 2.21
Date: Fri, 25 Dec 2015 23:44:42 +0100

When running "grep -Fi -f list.txt" where list.txt has thousands of
lines, it takes orders of magnitude longer to process the input file
than without the -i. This was not the case in grep 2.16, where fgrep
-i was about as fast as fgrep.

--
Ondřej Cífka

Information forwarded to bug-grep <at> gnu.org:
bug#22239; Package grep. (Mon, 11 Apr 2016 05:03:01 GMT) Full text and rfc822 format available.

Message #8 received at 22239 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Ondřej Cífka <ondra <at> cifka.com>
Cc: 22239 <at> debbugs.gnu.org
Subject: Re: fgrep -i slow in 2.21
Date: Sun, 10 Apr 2016 22:02:18 -0700

This is following up on:

http://bugs.gnu.org/22239

Could you please supply some specific test cases to illustrate the problem? I'm 
not seeing it, but this could be because I'm not using your locale, or am using 
machine-generated patterns. I am using master grep in the C locale on Ubuntu 
15.10, with a pattern file generated like this:

seq -f 'pattern %g' 100000 >list.txt

and running commands like this in the 'grep' source directory:

src/grep -F -i -f list.txt $(git ls-files | grep -v gnulib)

(or the same command without -i).

Added tag(s) moreinfo. Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Mon, 11 Apr 2016 05:03:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#22239; Package grep. (Mon, 11 Apr 2016 07:15:01 GMT) Full text and rfc822 format available.

Message #13 received at 22239 <at> debbugs.gnu.org (full text, mbox):

From: Ondřej Cífka <ondra <at> cifka.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 22239 <at> debbugs.gnu.org
Subject: Re: fgrep -i slow in 2.21
Date: Mon, 11 Apr 2016 09:14:40 +0200

You're probably right about the locale. I'm using cs_CZ.UTF-8. With
LC_ALL=C, both variants run faster and the difference is
insignificant.

With cs_CZ.UTF-8, on my machine, your test case takes 2.322s with -i
and 0.464s without -i.

I tested on my Aspell dictionary dump, where the difference is more noticeable:

aspell dump master | head -n 100000 >list.txt

grep 2.21 with -i: 7.336s
grep 2.21 without -i: 0.312s
grep 2.16 with -i: 0.372s
grep 2.16 without -i: 0.431s

With LC_ALL=C, both versions are about as fast.

On Mon, Apr 11, 2016 at 7:02 AM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> This is following up on:
>
> http://bugs.gnu.org/22239
>
> Could you please supply some specific test cases to illustrate the problem?
> I'm not seeing it, but this could be because I'm not using your locale, or
> am using machine-generated patterns. I am using master grep in the C locale
> on Ubuntu 15.10, with a pattern file generated like this:
>
> seq -f 'pattern %g' 100000 >list.txt
>
> and running commands like this in the 'grep' source directory:
>
> src/grep -F -i -f list.txt $(git ls-files | grep -v gnulib)
>
> (or the same command without -i).

Removed tag(s) moreinfo. Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Mon, 11 Apr 2016 07:35:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#22239; Package grep. (Wed, 21 Dec 2016 06:46:01 GMT) Full text and rfc822 format available.

Message #18 received at 22239 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: 22357-done <at> debbugs.gnu.org, 21763-done <at> debbugs.gnu.org
Cc: Norihiro Tanaka <noritnk <at> kcn.ne.jp>,
 sur-behoffski <sur_behoffski <at> grouse.com.au>, "L.A. Walsh" <law <at> tlinx.org>,
 Jim Meyering <jim <at> meyering.net>, "Bennett, Steve" <S.Bennett <at> lancaster.ac.uk>,
 JQK <jquan <at> redhat.com>, arnold <at> skeeve.com, 22239 <at> debbugs.gnu.org,
 Trevor Cordes <gnu <at> tecnopolis.ca>, Paul Jackson <pj <at> usa.net>,
 Bruno Haible <bruno <at> clisp.org>,
 Ondřej Cífka <ondra <at> cifka.com>,
 Bruce Dubbs <bruce.dubbs <at> gmail.com>
Subject: Re: bug#22357: grep -f not only huge memory usage, but also huge time
 cost
Date: Tue, 20 Dec 2016 21:17:01 -0800

[Message part 1 (text/plain, inline)]

I installed the attached patches into grep master. These fix the performance 
regressions noted at the start of Bug#22357. I see that the related performance 
problems noted in Bug#21763 seem to be fixed too, I expect because of Norihiro 
Tanaka's recent changes, so I'll boldly close both bug reports.

To some extent the attached patches restore the old behavior for grep -F, when 
grep is given two or more patterns. The patch doesn't change the underlying 
algorithms; it merely uses a different heuristic to decide whether to use the -F 
matcher. Although I wouldn't be surprised if the attached patches hurt 
performance in some cases, I didn't uncover any such cases in my performance 
testing, which I admit mostly consisted of running the examples in the 
abovementioned bug reports.

I'll leave Bug#22239 open, as I get the following performance figures 
(user+system CPU time) for the Bug#22239 benchmark, where list.txt is created by 
"aspell dump master | head -n 100000 >list.txt", and the grep commands all use 
the operands "-F -f list.txt /etc/passwd" in the en_US.utf8 locale on Fedora 24 
x86-64.

  no -i       -i       grep version
   0.25      0.33      2.16
   0.26     10.95      2.21
   0.11      2.90*     current master (including attached patches)

In the C locale, the current grep master is always significantly faster than 
grep 2.16 or 2.21 on the benchmark, so the only significant problem is the 
number marked "*". I ran the benchmarks on an AMD Phenom II X4 910e.

[0001-grep-simplify-line-counting-in-patterns.patch (text/x-diff, attachment)]

[0002-grep-simplify-matcher-configuration.patch (text/x-diff, attachment)]

[0003-grep-fix-performance-with-multiple-patterns.patch (text/x-diff, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#22239; Package grep. (Wed, 21 Dec 2016 23:24:02 GMT) Full text and rfc822 format available.

Message #21 received at 22239 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 21763-done <at> debbugs.gnu.org, Bruce Dubbs <bruce.dubbs <at> gmail.com>,
 22357-done <at> debbugs.gnu.org, sur-behoffski <sur_behoffski <at> grouse.com.au>,
 "L.A. Walsh" <law <at> tlinx.org>, Jim Meyering <jim <at> meyering.net>, "Bennett,
 Steve" <S.Bennett <at> lancaster.ac.uk>, JQK <jquan <at> redhat.com>, arnold <at> skeeve.com,
 22239 <at> debbugs.gnu.org, Trevor Cordes <gnu <at> tecnopolis.ca>,
 Paul Jackson <pj <at> usa.net>, Bruno Haible <bruno <at> clisp.org>,
 Ond?ej Cifka <ondra <at> cifka.com>
Subject: Re: bug#22357: grep -f not only huge memory usage,
 but also huge time cost
Date: Thu, 22 Dec 2016 08:04:42 +0900

On Tue, 20 Dec 2016 21:17:01 -0800
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> I installed the attached patches into grep master. These fix the
> performance regressions noted at the start of Bug#22357. I see that
> the related performance problems noted in Bug#21763 seem to be fixed
> too, I expect because of Norihiro Tanaka's recent changes, so I'll
> boldly close both bug reports.
> 
> To some extent the attached patches restore the old behavior for grep
> -F, when grep is given two or more patterns. The patch doesn't change
> the underlying algorithms; it merely uses a different heuristic to
> decide whether to use the -F matcher. Although I wouldn't be
> surprised if the attached patches hurt performance in some cases, I
> didn't uncover any such cases in my performance testing, which I
> admit mostly consisted of running the examples in the abovementioned
> bug reports.
> 
> I'll leave Bug#22239 open, as I get the following performance figures
> (user+system CPU time) for the Bug#22239 benchmark, where list.txt is
> created by "aspell dump master | head -n 100000 >list.txt", and the
> grep commands all use the operands "-F -f list.txt /etc/passwd" in
> the en_US.utf8 locale on Fedora 24 x86-64.
> 
>    no -i       -i       grep version
>     0.25      0.33      2.16
>     0.26     10.95      2.21
>     0.11      2.90*     current master (including attached patches)
> 
> In the C locale, the current grep master is always significantly
> faster than grep 2.16 or 2.21 on the benchmark, so the only
> significant problem is the number marked "*". I ran the benchmarks on
> an AMD Phenom II X4 910e.

Thanks.

BTW, are you aware of extreme slowdown in the following cases after
third patch?

  yes $(printf %040d 0) | head -10000000 >inp
  printf '0\n1\n' >pat
  env LC_ALL=C src/grep -w -f pat inp

Information forwarded to bug-grep <at> gnu.org:
bug#22239; Package grep. (Thu, 22 Dec 2016 17:11:02 GMT) Full text and rfc822 format available.

Message #24 received at 22239 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 21763-done <at> debbugs.gnu.org,
 Bruce Dubbs <bruce.dubbs <at> gmail.com>, 22357-done <at> debbugs.gnu.org,
 sur-behoffski <sur_behoffski <at> grouse.com.au>, "L.A. Walsh" <law <at> tlinx.org>,
 "Bennett, Steve" <S.Bennett <at> lancaster.ac.uk>, JQK <jquan <at> redhat.com>,
 Aharon Robbins <arnold <at> skeeve.com>, 22239 <at> debbugs.gnu.org,
 Trevor Cordes <gnu <at> tecnopolis.ca>, Paul Jackson <pj <at> usa.net>,
 Bruno Haible <bruno <at> clisp.org>, Ond?ej Cifka <ondra <at> cifka.com>
Subject: Re: bug#22357: grep -f not only huge memory usage,
 but also huge time cost
Date: Thu, 22 Dec 2016 03:14:16 -0800

On Wed, Dec 21, 2016 at 3:04 PM, Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
>   yes $(printf %040d 0) | head -10000000 >inp
>   printf '0\n1\n' >pat
>   env LC_ALL=C src/grep -w -f pat inp

Thanks for mentioning that.
Here is some actual data (Fedora 25 on an i7-4770S):

for i in $(env ls -1dv /p/p/grep-*/bin/grep) src/grep; do env LC_ALL=C
time -f "%e $i" $i -w -f pat inp; done
1.05 /p/p/grep-2.3/bin/grep
0.91 /p/p/grep-2.4.1/bin/grep
0.91 /p/p/grep-2.4.2/bin/grep
0.91 /p/p/grep-2.4/bin/grep
0.94 /p/p/grep-2.5.1/bin/grep
0.94 /p/p/grep-2.5.3/bin/grep
0.94 /p/p/grep-2.5.4/bin/grep
0.95 /p/p/grep-2.5/bin/grep
0.90 /p/p/grep-2.6.1/bin/grep
1.03 /p/p/grep-2.6.2/bin/grep
0.90 /p/p/grep-2.6.3/bin/grep
0.90 /p/p/grep-2.6/bin/grep
0.90 /p/p/grep-2.7/bin/grep
0.90 /p/p/grep-2.8/bin/grep
0.90 /p/p/grep-2.9/bin/grep
0.90 /p/p/grep-2.10/bin/grep
0.89 /p/p/grep-2.11/bin/grep
0.90 /p/p/grep-2.12/bin/grep
0.90 /p/p/grep-2.13/bin/grep
0.89 /p/p/grep-2.14/bin/grep
0.90 /p/p/grep-2.15/bin/grep
0.90 /p/p/grep-2.16/bin/grep
0.89 /p/p/grep-2.17/bin/grep
0.90 /p/p/grep-2.18/bin/grep
0.90 /p/p/grep-2.19/bin/grep
0.90 /p/p/grep-2.20/bin/grep
0.86 /p/p/grep-2.21/bin/grep
0.90 /p/p/grep-2.22/bin/grep
0.91 /p/p/grep-2.23/bin/grep
0.91 /p/p/grep-2.24/bin/grep
0.91 /p/p/grep-2.25/bin/grep
0.25 /p/p/grep-2.26/bin/grep
13.07 /p/p/grep-2.27/bin/grep
37.49 src/grep

Information forwarded to bug-grep <at> gnu.org:
bug#22239; Package grep. (Sat, 24 Dec 2016 01:39:02 GMT) Full text and rfc822 format available.

Message #27 received at 22239 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 21763-done <at> debbugs.gnu.org, 22357-done <at> debbugs.gnu.org,
 sur-behoffski <sur_behoffski <at> grouse.com.au>, "L.A. Walsh" <law <at> tlinx.org>,
 JQK <jquan <at> redhat.com>, 22239 <at> debbugs.gnu.org,
 Trevor Cordes <gnu <at> tecnopolis.ca>, Bruno Haible <bruno <at> clisp.org>,
 Ondřej Cífka <ondra <at> cifka.com>
Subject: Re: bug#22239: bug#22357: grep -f not only huge memory usage, but
 also huge time cost
Date: Fri, 23 Dec 2016 17:38:42 -0800

[Message part 1 (text/plain, inline)]

Norihiro Tanaka wrote:
> are you aware of extreme slowdown in the following cases after third patch?
>
>   yes $(printf %040d 0) | head -10000000 >inp
>   printf '0\n1\n' >pat
>   env LC_ALL=C src/grep -w -f pat inp

No. Thanks, I hadn't considered that possibility. I looked into the slowdown and 
installed the attached patches, which cause 'grep' to run about as fast on this 
test case as grep 2.25 (though not as fast as grep 2.26). The main fix is in 
patch 5. On my platform:

  -------grep version------
   v2.25  v2.26  v2.27 master     locale      command
    1.21   0.69  24.95   1.22     C           grep -w -f pat inp
  207.36 203.15 202.03   1.22     en_US.utf8  grep -w -f pat inp
    1.21   0.69  25.95   0.85     C           grep -w -f pat inp -F
   66.33  68.07  67.21   1.22     en_US.utf8  grep -w -f pat inp -F

All numbers are user+system CPU seconds on Fedora 24 x86-64 (AMD Phenom II X4 
910e). "master" means after the attached patches are installed.

Perhaps we can fiddle with the heuristics a bit so that v2.26 is not 
significantly faster than the master in the C locale.

[0001-maint-rewrite-to-avoid-some-macros.patch (text/x-diff, attachment)]

[0002-grep-remove-C-label.patch (text/x-diff, attachment)]

[0003-grep-simplify-Fexecute.patch (text/x-diff, attachment)]

[0004-grep-specialize-word-finding-functions.patch (text/x-diff, attachment)]

[0005-grep-speed-up-wf-in-C-locale.patch (text/x-diff, attachment)]

[0006-grep-standardize-on-localeinfo.multibyte.patch (text/x-diff, attachment)]

[0007-grep-improve-word-checking-with-UTF-8.patch (text/x-diff, attachment)]

[0008-grep-fix-comment-in-searchutils.c.patch (text/x-diff, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#22239; Package grep. (Mon, 26 Dec 2016 12:08:02 GMT) Full text and rfc822 format available.

Message #30 received at 22239 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 21763-done <at> debbugs.gnu.org, 22357-done <at> debbugs.gnu.org,
 sur-behoffski <sur_behoffski <at> grouse.com.au>, "L.A. Walsh" <law <at> tlinx.org>,
 JQK <jquan <at> redhat.com>, 22239 <at> debbugs.gnu.org,
 Trevor Cordes <gnu <at> tecnopolis.ca>, Bruno Haible <bruno <at> clisp.org>,
 Ond?ej Cifka <ondra <at> cifka.com>
Subject: Re: bug#22239: bug#22357: grep -f not only huge memory usage,
 but also huge time cost
Date: Mon, 26 Dec 2016 21:06:55 +0900

On Fri, 23 Dec 2016 17:38:42 -0800
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> No. Thanks, I hadn't considered that possibility. I looked into the
> slowdown and installed the attached patches, which cause 'grep' to
> run about as fast on this test case as grep 2.25 (though not as fast
> as grep 2.26). The main fix is in patch 5. On my platform:

Hmm, how about the following test cases, although it is extreame?

$ cat pat
0
00 0
00 00 0
00 00 00 0
00 00 00 00 0
00 00 00 00 00 0
00 00 00 00 00 00 0
00 00 00 00 00 00 00 0
00 00 00 00 00 00 00 00 0
00 00 00 00 00 00 00 00 00 0
00 00 00 00 00 00 00 00 00 00 0
00 00 00 00 00 00 00 00 00 00 00 0
00 00 00 00 00 00 00 00 00 00 00 00 0
00 00 00 00 00 00 00 00 00 00 00 00 00 0
$ yes '00 00 00 00 00 00 00 00 00 00 00 00 00' |
  head -1000000 >inp
$ time -p env LC_ALL=C src/grep -w -f pat inp

Information forwarded to bug-grep <at> gnu.org:
bug#22239; Package grep. (Mon, 26 Dec 2016 13:02:01 GMT) Full text and rfc822 format available.

Message #33 received at 22239 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 21763-done <at> debbugs.gnu.org,
 22357-done <at> debbugs.gnu.org, sur-behoffski <sur_behoffski <at> grouse.com.au>,
 "L.A. Walsh" <law <at> tlinx.org>, JQK <jquan <at> redhat.com>, 22239 <at> debbugs.gnu.org,
 Trevor Cordes <gnu <at> tecnopolis.ca>, Bruno Haible <bruno <at> clisp.org>,
 Ond?ej Cifka <ondra <at> cifka.com>
Subject: Re: bug#22239: bug#22357: grep -f not only huge memory usage, but
 also huge time cost
Date: Mon, 26 Dec 2016 14:01:16 +0100

On Mon, Dec 26, 2016 at 1:06 PM, Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
>
> On Fri, 23 Dec 2016 17:38:42 -0800
> Paul Eggert <eggert <at> cs.ucla.edu> wrote:
>
>> No. Thanks, I hadn't considered that possibility. I looked into the
>> slowdown and installed the attached patches, which cause 'grep' to
>> run about as fast on this test case as grep 2.25 (though not as fast
>> as grep 2.26). The main fix is in patch 5. On my platform:
>
> Hmm, how about the following test cases, although it is extreame?
>
> $ cat pat
> 0
> 00 0
> 00 00 0
> 00 00 00 0
> 00 00 00 00 0
> 00 00 00 00 00 0
> 00 00 00 00 00 00 0
> 00 00 00 00 00 00 00 0
> 00 00 00 00 00 00 00 00 0
> 00 00 00 00 00 00 00 00 00 0
> 00 00 00 00 00 00 00 00 00 00 0
> 00 00 00 00 00 00 00 00 00 00 00 0
> 00 00 00 00 00 00 00 00 00 00 00 00 0
> 00 00 00 00 00 00 00 00 00 00 00 00 00 0
> $ yes '00 00 00 00 00 00 00 00 00 00 00 00 00' |
>   head -1000000 >inp
> $ time -p env LC_ALL=C src/grep -w -f pat inp

That took 7 seconds for me.

Here is one that is O(N^2) in the length of the literal search string:
(the search strings are sequences of '0's, with lengths from 10k.. 320k):

$ n=10000; for i in $(seq 6); do env time -f "$n %e" src/grep -f
<(printf %0${n}d 0) <<<1; ((n *= 2)); done
10000 0.44
20000 1.44
40000 5.41
80000 21.27
160000 97.27
320000 426.72

Information forwarded to bug-grep <at> gnu.org:
bug#22239; Package grep. (Mon, 26 Dec 2016 20:08:02 GMT) Full text and rfc822 format available.

Message #36 received at 22239 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 21763-done <at> debbugs.gnu.org, 22357-done <at> debbugs.gnu.org,
 sur-behoffski <sur_behoffski <at> grouse.com.au>, "L.A. Walsh" <law <at> tlinx.org>,
 JQK <jquan <at> redhat.com>, 22239 <at> debbugs.gnu.org,
 Trevor Cordes <gnu <at> tecnopolis.ca>, Bruno Haible <bruno <at> clisp.org>,
 Ond?ej Cifka <ondra <at> cifka.com>
Subject: Re: bug#21763: bug#22239: bug#22357: grep -f not only huge memory
 usage, but also huge time cost
Date: Mon, 26 Dec 2016 12:07:49 -0800

[Message part 1 (text/plain, inline)]

Norihiro Tanaka wrote:
> Hmm, how about the following test cases, although it is extreame?

I don't think we need to worry about performance for the case when -w is given, 
and a pattern matches data that contains non-word characters. In practice, such 
cases are rare. I expect that most users would be surprised that -w can match 
non-word characters, and that users wouldn't object to -w rejecting such matches 
(if this wouldn't hurt performance significantly).

While looking into this I did find a very small performance tweak for the test 
case, and installed the attached.

[0001-grep-minor-performance-tweak-for-pure-functions.patch (text/x-diff, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#22239; Package grep. (Wed, 28 Dec 2016 00:22:01 GMT) Full text and rfc822 format available.

Message #39 received at 22239 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 21763-done <at> debbugs.gnu.org, 22357-done <at> debbugs.gnu.org,
 sur-behoffski <sur_behoffski <at> grouse.com.au>, "L.A. Walsh" <law <at> tlinx.org>,
 JQK <jquan <at> redhat.com>, 22239 <at> debbugs.gnu.org,
 Trevor Cordes <gnu <at> tecnopolis.ca>, Bruno Haible <bruno <at> clisp.org>,
 Ond?ej Cifka <ondra <at> cifka.com>
Subject: Re: bug#21763: bug#22239: bug#22357: grep -f not only huge memory
 usage, but also huge time cost
Date: Wed, 28 Dec 2016 09:21:02 +0900

[Message part 1 (text/plain, inline)]

On Mon, 26 Dec 2016 12:07:49 -0800
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> Norihiro Tanaka wrote:
> > Hmm, how about the following test cases, although it is extreame?
> 
> I don't think we need to worry about performance for the case when -w
> is given, and a pattern matches data that contains non-word
> characters. In practice, such cases are rare. I expect that most
> users would be surprised that -w can match non-word characters, and
> that users wouldn't object to -w rejecting such matches (if this
> wouldn't hurt performance significantly).
> 
> While looking into this I did find a very small performance tweak for
> the test case, and installed the attached.

Thanks.

BTW, with multiple patterns in current master, former uses fgrep matcher,
and later users grep matcher.  I think that it is not reasonable.

  env LC_ALL=C grep -w -f pat inp

  env LC_ALL=C grep -F -w -f pat inp

So I wrote the patch to use fgrep matcher for both.

[0001-grep-imorove-performance-with-multiple-patterns.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#22239; Package grep. (Wed, 28 Dec 2016 06:38:01 GMT) Full text and rfc822 format available.

Message #42 received at 22239 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 21763-done <at> debbugs.gnu.org, 22357-done <at> debbugs.gnu.org,
 sur-behoffski <sur_behoffski <at> grouse.com.au>, "L.A. Walsh" <law <at> tlinx.org>,
 JQK <jquan <at> redhat.com>, 22239 <at> debbugs.gnu.org,
 Trevor Cordes <gnu <at> tecnopolis.ca>, Bruno Haible <bruno <at> clisp.org>,
 Ond?ej Cifka <ondra <at> cifka.com>
Subject: Re: bug#21763: bug#22239: bug#22357: grep -f not only huge memory
 usage, but also huge time cost
Date: Tue, 27 Dec 2016 22:37:25 -0800

Norihiro Tanaka wrote:
> So I wrote the patch to use fgrep matcher for both.

Thanks, I installed that after tweaking the commit message and omitting 
unnecessary parens.

Information forwarded to bug-grep <at> gnu.org:
bug#22239; Package grep. (Wed, 28 Dec 2016 14:34:01 GMT) Full text and rfc822 format available.

Message #45 received at 22239 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 21763-done <at> debbugs.gnu.org, 22357-done <at> debbugs.gnu.org,
 sur-behoffski <sur_behoffski <at> grouse.com.au>, "L.A. Walsh" <law <at> tlinx.org>,
 JQK <jquan <at> redhat.com>, 22239 <at> debbugs.gnu.org,
 Trevor Cordes <gnu <at> tecnopolis.ca>, Bruno Haible <bruno <at> clisp.org>,
 Ond?ej Cifka <ondra <at> cifka.com>
Subject: Re: bug#21763: bug#22239: bug#22357: grep -f not only huge memory
 usage, but also huge time cost
Date: Wed, 28 Dec 2016 23:33:34 +0900

On Tue, 27 Dec 2016 22:37:25 -0800
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> Norihiro Tanaka wrote:
> > So I wrote the patch to use fgrep matcher for both.
> 
> Thanks, I installed that after tweaking the commit message and omitting unnecessary parens.

Thanks, I confirmed it.

Information forwarded to bug-grep <at> gnu.org:
bug#22239; Package grep. (Sun, 08 Jan 2017 06:05:02 GMT) Full text and rfc822 format available.

Message #48 received at 22239 <at> debbugs.gnu.org (full text, mbox):

From: Aaron Zhou Qian <aaronzhouqian <at> ucla.edu>
To: Paul Eggert <eggert <at> cs.ucla.edu>, 22239 <at> debbugs.gnu.org
Subject: Re: New Project
Date: Sat, 7 Jan 2017 22:04:00 -0800

[Message part 1 (text/plain, inline)]

On Sat, Jan 7, 2017 at 9:29 AM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
>
> Could you remind us about the latest status of your proposal compared to
> Zev Weiss's? Does <http://bugs.gnu.org/24689> contain the latest thing
> you have? Zev Weiss's latest version is at <https://github.com/zevweiss/g
> rep>. Comparing the two was the thing Jim Meyering asked for at <
> http://bugs.gnu.org/22239#8>, and you can follow up by sending email to
> 22239 <at> debbugs.gnu.org.

Yes that github link is the latest version. I haven't made any changes to
that since last year September.

Basically the main thread traverses the file tree and assign the file to be
searched to each thread. There is also a dynamic buffer so that the output
is identical to the original grep program.

I tested the program on a server. On a directory containing 4 files, grep -r
on that directory is 4 times faster. On a directory containing 8 files, grep -r
is 6 times faster. On a directory containing 12 files, grep -r is 8.5 times
faster.

I think using multithreading is essentially different from not using
multithreading, and we also don't use multithreading all the time for grep.
When we're not using multithreading, i.e. when we pass in other options for
grep, more functions  would call those functions whose function signatures
we changed. This is hard to keep track of, because the program is fairly
complicated.

If we had overloading in C++ I would overload those functions. But since we
don't, I made it very clear in the code which functions are the
counterparts of the original versions. I did this to contain any potential
problems so that if there are any problems with multithreading it would not
affect the sequential program, whereas if we interleave the two scenarios
we might lose track of what's going on. At least this is what I initially
thought.

I saw that there were some recent commits by Zev together with Jim, for
example:

in

commit 9365ed6536d4fabf42ec17fef1bbe5d78884f950

* src/grep.c (compile_fp_t): Now returns an opaque pointer (the
compiled pattern).
(execute_fp_t): Now passed the pointer returned by a compile_fp_t.
All call sites updated accordingly.
(compiled_pattern): New static variable.
* src/dfasearch.c (GEAcompile): Return a void pointer (dummy NULL).
(EGexecute): Receive a void pointer argument (unused).
* src/kwsearch.c (Fcompile): Return a void pointer (dummy NULL).
(Fexecute): Receive a void pointer argument (unused).
* src/pcresearch.c (Pcompile): Return a void pointer (dummy NULL).
(Pexecute): Receive a void pointer argument (unused).
* src/search.h: Update compile/execute function prototypes.

So we have different approaches. They are trying to add extra pointer
arguments for the multithreading case. The pointer argument would be
NULL in the case multithreading is not in effect.  Whereas my approach
is to replicate the functions so the counterparts of the original
functions are used in the multithreading scenario. This was done in an
attempt to reduce the complexity of each of the functions and make the
program less monolithic. I leave you guys to decide.

[Message part 2 (text/html, inline)]

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Tue, 17 Jan 2017 16:19:02 GMT) Full text and rfc822 format available.

Notification sent to Ondřej Cífka <ondra <at> cifka.com>:
bug acknowledged by developer. (Tue, 17 Jan 2017 16:19:02 GMT) Full text and rfc822 format available.

Message #53 received at 22239-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Ondřej Cífka <ondra <at> cifka.com>
Cc: 22239-done <at> debbugs.gnu.org
Subject: Re: bug#22239: fgrep -i slow in 2.21
Date: Tue, 17 Jan 2017 08:18:30 -0800

[Message part 1 (text/plain, inline)]

On 04/11/2016 12:14 AM, Ondřej Cífka wrote:
> You're probably right about the locale. I'm using cs_CZ.UTF-8. With
> LC_ALL=C, both variants run faster and the difference is
> insignificant.
>
> With cs_CZ.UTF-8, on my machine, your test case takes 2.322s with -i
> and 0.464s without -i.
>
> I tested on my Aspell dictionary dump, where the difference is more noticeable:
>
> aspell dump master | head -n 100000 >list.txt
>
> grep 2.21 with -i: 7.336s
> grep 2.21 without -i: 0.312s
> grep 2.16 with -i: 0.372s
> grep 2.16 without -i: 0.431s
>
> With LC_ALL=C, both versions are about as fast.

I got some free time to look into this, and installed the attached set
of patches; the 2nd one is the key one. In the en_EN.utf8 locale on my
platform (Fedora 25 x86-64), I get the following user times for 'grep
-Ff list.txt list.txt' where list.txt was generated as you describe:

   0.444 grep 2.16
   0.522 grep 2.16 -i
   0.443 grep 2.21
  13.048 grep 2.21 -i
   0.096 grep current
   0.101 grep current -i

Since this patch causes grep to use Aho-Corasick more often, I expect it
to hurt performance in some cases involving multiple patterns, but we
can look into that as they turn up. In the meantime since the original
bug seems to be fixed I am taking the liberty of closing the bug report.

[0001-build-update-gnulib-submodule-to-latest.txt (text/plain, attachment)]

[0002-Improve-i-performance-in-typical-UTF-8-searches.txt (text/plain, attachment)]

[0003-src-kwset.c-Fix-comment-typo.txt (text/plain, attachment)]

[0004-NEWS-Fix-typo.txt (text/plain, attachment)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Wed, 15 Feb 2017 12:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 7 years and 65 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #22239 fgrep -i slow in 2.21

GNU bug report logs - #22239
fgrep -i slow in 2.21