GNU bug report logs - #24689
Multithreaded grep Available

Previous Next

Package: grep;

Reported by: Aaron Zhou Qian <aaronzhouqian <at> ucla.edu>

Date: Fri, 14 Oct 2016 15:36:02 UTC

Severity: normal

To reply to this bug, email your comments to 24689 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#24689; Package grep. (Fri, 14 Oct 2016 15:36:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Aaron Zhou Qian <aaronzhouqian <at> ucla.edu>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Fri, 14 Oct 2016 15:36:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Aaron Zhou Qian <aaronzhouqian <at> ucla.edu>
To: bug-grep <at> gnu.org
Subject: Multithreaded grep Available
Date: Fri, 14 Oct 2016 08:14:42 -0700
[Message part 1 (text/plain, inline)]
Hi,

I've multithreaded the grep at the file granularity when used with -r or -R
on directories. The four files that I changed i.e. grep.c dosbuff.c
dfasearch.c search.h under src/ is in this repository:

https://github.com/AaronZhouQian/grep

Please take a look at the README there.


The default number of threads is the number of cores online in the system.

To specify a custom number of threads use -p or --parallel followed by the
number of threads.

Currently multithreading does not support context i.e. -A -B -C


I'd like to invite everyone to test this patch.


Thank you,

Aaron
[Message part 2 (text/html, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#24689; Package grep. (Sat, 15 Oct 2016 05:05:02 GMT) Full text and rfc822 format available.

Message #8 received at 24689 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Aaron Zhou Qian <aaronzhouqian <at> ucla.edu>, Zev Weiss <zev <at> bewilderbeest.net>
Cc: 24689 <at> debbugs.gnu.org
Subject: Re: bug#24689: Multithreaded grep Available
Date: Fri, 14 Oct 2016 22:03:52 -0700
On Fri, Oct 14, 2016 at 8:14 AM, Aaron Zhou Qian <aaronzhouqian <at> ucla.edu> wrote:
> Hi,
>
> I've multithreaded the grep at the file granularity when used with -r or -R
> on directories. The four files that I changed i.e. grep.c dosbuff.c
> dfasearch.c search.h under src/ is in this repository:
>
> https://github.com/AaronZhouQian/grep
>
> Please take a look at the README there.
>
>
> The default number of threads is the number of cores online in the system.
>
> To specify a custom number of threads use -p or --parallel followed by the
> number of threads.
>
> Currently multithreading does not support context i.e. -A -B -C
>
>
> I'd like to invite everyone to test this patch.

Thank you for working on that.
Before even looking at your patch, I have to ask whether you have
filed copyright assignment papers with the FSF. If not, I can get you
started.

Secondly, did you know that Zev Weiss has submitted some patches to
make grep multithreaded? He also helped to make dfa.c (now in gnulib)
multithread safe recently. It's been a while, and I don't know the
status of his patch set. At a minimum, please tell us how your
approach compares to his. https://github.com/zevweiss/grep




Information forwarded to bug-grep <at> gnu.org:
bug#24689; Package grep. (Tue, 10 Jan 2017 00:47:02 GMT) Full text and rfc822 format available.

Message #11 received at 24689 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Aaron Zhou Qian <aaronzhouqian <at> ucla.edu>
Cc: 24689 <at> debbugs.gnu.org
Subject: Fwd: bug#22239: New Project
Date: Mon, 9 Jan 2017 16:46:30 -0800
Oops, sorry, I gave you the wrong bug number. I am forwarding this to
Bug#24689, which is what I should have told you in the first place.


-------- Forwarded Message --------
Subject: 	bug#22239: New Project
Resent-Date: 	Sun, 08 Jan 2017 06:05:02 +0000
Resent-From: 	Aaron Zhou Qian <aaronzhouqian <at> ucla.edu>
Resent-CC: 	bug-grep <at> gnu.org
Date: 	Sat, 7 Jan 2017 22:04:00 -0800
From: 	Aaron Zhou Qian <aaronzhouqian <at> ucla.edu>
To: 	Paul Eggert <eggert <at> cs.ucla.edu>, 22239 <at> debbugs.gnu.org



On Sat, Jan 7, 2017 at 9:29 AM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
>
> Could you remind us about the latest status of your proposal compared to
> Zev Weiss's? Does <http://bugs.gnu.org/24689> contain the latest thing
> you have? Zev Weiss's latest version is at <https://github.com/zevweiss/g
> rep>. Comparing the two was the thing Jim Meyering asked for at <
> http://bugs.gnu.org/22239#8>, and you can follow up by sending email to
> 22239 <at> debbugs.gnu.org.




Yes that github link is the latest version. I haven't made any changes to
that since last year September.

Basically the main thread traverses the file tree and assign the file to be
searched to each thread. There is also a dynamic buffer so that the output
is identical to the original grep program.

I tested the program on a server. On a directory containing 4 files, grep -r
on that directory is 4 times faster. On a directory containing 8 files, grep -r
is 6 times faster. On a directory containing 12 files, grep -r is 8.5 times
faster.

I think using multithreading is essentially different from not using
multithreading, and we also don't use multithreading all the time for grep.
When we're not using multithreading, i.e. when we pass in other options for
grep, more functions  would call those functions whose function signatures
we changed. This is hard to keep track of, because the program is fairly
complicated.

If we had overloading in C++ I would overload those functions. But since we
don't, I made it very clear in the code which functions are the
counterparts of the original versions. I did this to contain any potential
problems so that if there are any problems with multithreading it would not
affect the sequential program, whereas if we interleave the two scenarios
we might lose track of what's going on. At least this is what I initially
thought.

I saw that there were some recent commits by Zev together with Jim, for
example:

in

commit 9365ed6536d4fabf42ec17fef1bbe5d78884f950

* src/grep.c (compile_fp_t): Now returns an opaque pointer (the
compiled pattern).
(execute_fp_t): Now passed the pointer returned by a compile_fp_t.
All call sites updated accordingly.
(compiled_pattern): New static variable.
* src/dfasearch.c (GEAcompile): Return a void pointer (dummy NULL).
(EGexecute): Receive a void pointer argument (unused).
* src/kwsearch.c (Fcompile): Return a void pointer (dummy NULL).
(Fexecute): Receive a void pointer argument (unused).
* src/pcresearch.c (Pcompile): Return a void pointer (dummy NULL).
(Pexecute): Receive a void pointer argument (unused).
* src/search.h: Update compile/execute function prototypes.

So we have different approaches. They are trying to add extra pointer
arguments for the multithreading case. The pointer argument would be
NULL in the case multithreading is not in effect.  Whereas my approach
is to replicate the functions so the counterparts of the original
functions are used in the multithreading scenario. This was done in an
attempt to reduce the complexity of each of the functions and make the
program less monolithic. I leave you guys to decide.





This bug report was last modified 7 years and 114 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.