GNU bug report logs -
#44754
Extreme performance degradation in GNU grep 3.4 / 3.6
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 44754 in the body.
You can then email your comments to 44754 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#44754
; Package
grep
.
(Fri, 20 Nov 2020 05:32:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
bug-grep <at> gnu.org, f.heckenbach <at> fh-soft.de
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Fri, 20 Nov 2020 05:32:01 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Hi,
I have a use case where I run grep with a large number of search
patterns on a large text file. It works well with grep-3.3, but with
grep-3.4 it quickly burned through GBs of memory and almost locked
up my system due to swapping.
To avoid attaching those large files, I could mostly reproduce the
effects like this:
ulimit -d 5000000 # avoid system lockup due to excessive swapping
export LC_ALL=C # make sure no Unicode case conversions are needed
% time ./grep-3.3 -Fwif <(seq 300000 | tr 0-9 A-J) <<<foo
real 0m0.054s
user 0m0.048s
sys 0m0.012s
% time ./grep-3.4 -Fwif <(seq 30000 | tr 0-9 A-J) <<<foo
./grep-3.4: Memory exhausted
Aborted
real 0m1.291s
user 0m0.696s
sys 0m0.599s
% time ./grep-3.6 -Fwif <(seq 300000 | tr 0-9 A-J) <<<foo
real 0m13.162s
user 0m12.955s
sys 0m0.211s
grep-3.3 behaves well, even with much larger number of patterns.
Time seems to grow linearly, and memory usage is constant.
grep-3.4 behaves the worst of these 3 versions. Even with just 30000
patterns it exceeds the ulimit of 5 GB.
grep-3.6 behaves a bit better than 3.4, but still bad. Time seems to
be quadratic in the number of patterns, and though memory usage in
this case seems to be almost constant, in my actual use case it also
runs out of memory where grep-3.3 works well with just a few 100 MB
used.
Without "-i", grep-3.4 seems to run as fast as grep-3.3, but
grep-3.6 is almost as slow as with "-i".
So there might actually be two different issues here, one that
affects 3.4 with "-i" and one that affects 3.6 with or without "-i".
Regards,
Frank
Information forwarded
to
bug-grep <at> gnu.org
:
bug#44754
; Package
grep
.
(Thu, 26 Nov 2020 01:13:02 GMT)
Full text and
rfc822 format available.
Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Thu, Nov 19, 2020 at 7:32 PM Frank Heckenbach
<f.heckenbach <at> fh-soft.de> wrote:
> I have a use case where I run grep with a large number of search
> patterns on a large text file. It works well with grep-3.3, but with
> grep-3.4 it quickly burned through GBs of memory and almost locked
> up my system due to swapping.
>
> To avoid attaching those large files, I could mostly reproduce the
> effects like this:
>
> ulimit -d 5000000 # avoid system lockup due to excessive swapping
> export LC_ALL=C # make sure no Unicode case conversions are needed
>
> % time ./grep-3.3 -Fwif <(seq 300000 | tr 0-9 A-J) <<<foo
>
> real 0m0.054s
> user 0m0.048s
> sys 0m0.012s
>
> % time ./grep-3.4 -Fwif <(seq 30000 | tr 0-9 A-J) <<<foo
> ./grep-3.4: Memory exhausted
> Aborted
>
> real 0m1.291s
> user 0m0.696s
> sys 0m0.599s
>
> % time ./grep-3.6 -Fwif <(seq 300000 | tr 0-9 A-J) <<<foo
>
> real 0m13.162s
> user 0m12.955s
> sys 0m0.211s
>
> grep-3.3 behaves well, even with much larger number of patterns.
> Time seems to grow linearly, and memory usage is constant.
>
> grep-3.4 behaves the worst of these 3 versions. Even with just 30000
> patterns it exceeds the ulimit of 5 GB.
>
> grep-3.6 behaves a bit better than 3.4, but still bad. Time seems to
> be quadratic in the number of patterns, and though memory usage in
> this case seems to be almost constant, in my actual use case it also
> runs out of memory where grep-3.3 works well with just a few 100 MB
> used.
>
> Without "-i", grep-3.4 seems to run as fast as grep-3.3, but
> grep-3.6 is almost as slow as with "-i".
>
> So there might actually be two different issues here, one that
> affects 3.4 with "-i" and one that affects 3.6 with or without "-i".
Thank you for the fine bug report.
The grep-3.6 bug you've exposed is due to the fact that your input
triggers excessive hash collisions when using the code modeled after
gnulib/lib/hash-pjw.c. That made the new pattern-preprocessing phase
take O(N^2) time for N patterns. In the attached, I've switched grep
to use the djb2 hash function, and that resolves the problem. I'll
also add a NEWS entry and a test before pushing this.
[0001-grep-avoid-performance-regression-with-many-patterns.patch (application/octet-stream, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#44754
; Package
grep
.
(Thu, 26 Nov 2020 01:13:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#44754
; Package
grep
.
(Thu, 26 Nov 2020 19:04:02 GMT)
Full text and
rfc822 format available.
Message #14 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Wed, Nov 25, 2020 at 3:12 PM Jim Meyering <jim <at> meyering.net> wrote:
> Thank you for the fine bug report.
> The grep-3.6 bug you've exposed is due to the fact that your input
> triggers excessive hash collisions when using the code modeled after
> gnulib/lib/hash-pjw.c. That made the new pattern-preprocessing phase
> take O(N^2) time for N patterns. In the attached, I've switched grep
> to use the djb2 hash function, and that resolves the problem. I'll
> also add a NEWS entry and a test before pushing this.
Timings suggest that grep-3.6's preprocessing came closer to O(N^3).
Here's an example that would take 2-3 days with grep-3.6 and only
seconds with this fix:
: | grep -Ff <(seq 6400000 | tr 0-9 A-J)
Here's a complete patch.
I'll push it later today.
[0001-grep-avoid-performance-regression-with-many-patterns.patch (application/octet-stream, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#44754
; Package
grep
.
(Thu, 26 Nov 2020 19:05:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#44754
; Package
grep
.
(Fri, 27 Nov 2020 07:42:02 GMT)
Full text and
rfc822 format available.
Message #20 received at submit <at> debbugs.gnu.org (full text, mbox):
On Thu, Nov 26, 2020 at 9:03 AM Jim Meyering <jim <at> meyering.net> wrote:
>
> On Wed, Nov 25, 2020 at 3:12 PM Jim Meyering <jim <at> meyering.net> wrote:
> > Thank you for the fine bug report.
> > The grep-3.6 bug you've exposed is due to the fact that your input
> > triggers excessive hash collisions when using the code modeled after
> > gnulib/lib/hash-pjw.c. That made the new pattern-preprocessing phase
> > take O(N^2) time for N patterns. In the attached, I've switched grep
> > to use the djb2 hash function, and that resolves the problem. I'll
> > also add a NEWS entry and a test before pushing this.
>
> Timings suggest that grep-3.6's preprocessing came closer to O(N^3).
> Here's an example that would take 2-3 days with grep-3.6 and only
> seconds with this fix:
>
> : | grep -Ff <(seq 6400000 | tr 0-9 A-J)
>
> Here's a complete patch.
> I'll push it later today.
Pushed along with two gnulib-related changes.
Reply sent
to
Jim Meyering <jim <at> meyering.net>
:
You have taken responsibility.
(Fri, 27 Nov 2020 07:42:03 GMT)
Full text and
rfc822 format available.
Notification sent
to
bug-grep <at> gnu.org, f.heckenbach <at> fh-soft.de
:
bug acknowledged by developer.
(Fri, 27 Nov 2020 07:42:03 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#44754
; Package
grep
.
(Thu, 03 Dec 2020 08:27:02 GMT)
Full text and
rfc822 format available.
Message #28 received at 44754 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Thu, 26 Nov 2020 21:41:20 -1000
Jim Meyering <jim <at> meyering.net> wrote:
> On Thu, Nov 26, 2020 at 9:03 AM Jim Meyering <jim <at> meyering.net> wrote:
> >
> > On Wed, Nov 25, 2020 at 3:12 PM Jim Meyering <jim <at> meyering.net> wrote:
> > > Thank you for the fine bug report.
> > > The grep-3.6 bug you've exposed is due to the fact that your input
> > > triggers excessive hash collisions when using the code modeled after
> > > gnulib/lib/hash-pjw.c. That made the new pattern-preprocessing phase
> > > take O(N^2) time for N patterns. In the attached, I've switched grep
> > > to use the djb2 hash function, and that resolves the problem. I'll
> > > also add a NEWS entry and a test before pushing this.
> >
> > Timings suggest that grep-3.6's preprocessing came closer to O(N^3).
> > Here's an example that would take 2-3 days with grep-3.6 and only
> > seconds with this fix:
> >
> > : | grep -Ff <(seq 6400000 | tr 0-9 A-J)
> >
> > Here's a complete patch.
> > I'll push it later today.
>
> Pushed along with two gnulib-related changes.
The fix has improved some performance. However, it's still quite slow
compared to version 3.3, and that can be remedied.
It converts to grep only if the potential match does not match the word
frequently.
[0001-grep-improvement-of-grep-convertion-for-fgrep.patch (text/plain, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#44754
; Package
grep
.
(Sat, 05 Dec 2020 18:07:02 GMT)
Full text and
rfc822 format available.
Message #31 received at 44754 <at> debbugs.gnu.org (full text, mbox):
On Thu, Dec 3, 2020 at 12:26 AM Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> On Thu, 26 Nov 2020 21:41:20 -1000
> Jim Meyering <jim <at> meyering.net> wrote:
>
> > On Thu, Nov 26, 2020 at 9:03 AM Jim Meyering <jim <at> meyering.net> wrote:
> > >
> > > On Wed, Nov 25, 2020 at 3:12 PM Jim Meyering <jim <at> meyering.net> wrote:
> > > > Thank you for the fine bug report.
> > > > The grep-3.6 bug you've exposed is due to the fact that your input
> > > > triggers excessive hash collisions when using the code modeled after
> > > > gnulib/lib/hash-pjw.c. That made the new pattern-preprocessing phase
> > > > take O(N^2) time for N patterns. In the attached, I've switched grep
> > > > to use the djb2 hash function, and that resolves the problem. I'll
> > > > also add a NEWS entry and a test before pushing this.
> > >
> > > Timings suggest that grep-3.6's preprocessing came closer to O(N^3).
> > > Here's an example that would take 2-3 days with grep-3.6 and only
> > > seconds with this fix:
> > >
> > > : | grep -Ff <(seq 6400000 | tr 0-9 A-J)
> > >
> > > Here's a complete patch.
> > > I'll push it later today.
> >
> > Pushed along with two gnulib-related changes.
>
> The fix has improved some performance. However, it's still quite slow
> compared to version 3.3, and that can be remedied.
>
> It converts to grep only if the potential match does not match the word
> frequently.
Thank you for that patch. Can you say a little more about the domain
of the problem?
I.e., is it specific to invocations with "-w"?
Can you provide an example that exhibits the performance improvement,
with timings?
Information forwarded
to
bug-grep <at> gnu.org
:
bug#44754
; Package
grep
.
(Sun, 06 Dec 2020 06:07:02 GMT)
Full text and
rfc822 format available.
Message #34 received at 44754 <at> debbugs.gnu.org (full text, mbox):
Jim Meyering <jim <at> meyering.net> wrote:
> On Thu, Dec 3, 2020 at 12:26 AM Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> > On Thu, 26 Nov 2020 21:41:20 -1000
> > Jim Meyering <jim <at> meyering.net> wrote:
> >
> > > On Thu, Nov 26, 2020 at 9:03 AM Jim Meyering <jim <at> meyering.net> wrote:
> > > >
> > > > On Wed, Nov 25, 2020 at 3:12 PM Jim Meyering <jim <at> meyering.net> wrote:
> > > > > Thank you for the fine bug report.
> > > > > The grep-3.6 bug you've exposed is due to the fact that your input
> > > > > triggers excessive hash collisions when using the code modeled after
> > > > > gnulib/lib/hash-pjw.c. That made the new pattern-preprocessing phase
> > > > > take O(N^2) time for N patterns. In the attached, I've switched grep
> > > > > to use the djb2 hash function, and that resolves the problem. I'll
> > > > > also add a NEWS entry and a test before pushing this.
> > > >
> > > > Timings suggest that grep-3.6's preprocessing came closer to O(N^3).
> > > > Here's an example that would take 2-3 days with grep-3.6 and only
> > > > seconds with this fix:
> > > >
> > > > : | grep -Ff <(seq 6400000 | tr 0-9 A-J)
> > > >
> > > > Here's a complete patch.
> > > > I'll push it later today.
> > >
> > > Pushed along with two gnulib-related changes.
> >
> > The fix has improved some performance. However, it's still quite slow
> > compared to version 3.3, and that can be remedied.
> >
> > It converts to grep only if the potential match does not match the word
> > frequently.
>
> Thank you for that patch. Can you say a little more about the domain
> of the problem?
> I.e., is it specific to invocations with "-w"?
> Can you provide an example that exhibits the performance improvement,
> with timings?
I can confirm that it seems to solve my problem (my actual use
case):
grep-3.3:
real 0m0.861s
user 0m0.784s
sys 0m0.076s
grep-3.6 with this patch:
real 0m1.052s
user 0m0.943s
sys 0m0.108s
grep-3.6 with both your patch and this one:
real 0m1.097s
user 0m1.037s
sys 0m0.060s
Apparently there were at least two issues here. While your patch
fixes the simplified case I reported and had no effect on my actual
use case, Norihiro Tanaka's patch fixes the latter (and has no big
effect on the former).
So it looks like both patches are indeed needed.
Thanks,
Frank
Information forwarded
to
bug-grep <at> gnu.org
:
bug#44754
; Package
grep
.
(Mon, 07 Dec 2020 07:39:01 GMT)
Full text and
rfc822 format available.
Message #37 received at 44754 <at> debbugs.gnu.org (full text, mbox):
On Sat, 5 Dec 2020 10:06:27 -0800
Jim Meyering <jim <at> meyering.net> wrote:
> Thank you for that patch. Can you say a little more about the domain
> of the problem?
> I.e., is it specific to invocations with "-w"?
> Can you provide an example that exhibits the performance improvement,
> with timings?
The test case shown by Frank speeds-up about 10x. That is a little
slower than grep 3.3.
Before:
$ env LC_ALL=C time -p src/grep -Fwif <(seq 300000 | tr 0-9 A-J) <<<foo
real 4.91
user 4.41
sys 0.30
After:
$ env LC_ALL=C time -p src/grep -Fwif <(seq 300000 | tr 0-9 A-J) <<<foo
real 0.37
user 0.16
sys 0.01
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Mon, 04 Jan 2021 12:24:09 GMT)
Full text and
rfc822 format available.
This bug report was last modified 3 years and 123 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.