GNU bug report logs -
#37754
wish for grep --and -eX -eY -eZ (X∩Y∩Z intersection, not X∪Y∪Z union)
Previous Next
To reply to this bug, email your comments to 37754 AT debbugs.gnu.org.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#37754
; Package
grep
.
(Tue, 15 Oct 2019 01:49:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
"Trent W. Buck" <trentbuck <at> gmail.com>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Tue, 15 Oct 2019 01:49:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Package: grep
Version: 3.3-1
Severity: wishlist
This bug was originally reported as
https://bugs.debian.org/940464
Trent W. Buck wrote:
> (Surely someone has already asked for this, but I can't see where.
> I may have already reported this myself, and forgotten.
> If so, sorry!)
>
> Right now if you do
>
> grep -eX -eY -eZ
>
> You'll get lines that match *any of* X, Y, or Z.
> Quite often I want to search for lines that match *all of* X, Y, and Z — but in any order.
> For example,
>
> # all 4TB 2.5-inch SATA products
> grep -Fwi -eSATA -e2TB -e2.5in products.csv
>
> Below is a short discussion of the workarounds I know about.
>
> Is "grep --and" something that has already been discussed and rejected?
> I looked through debbugs.gnu.org and the source tarball, but
> I couldn't find anything about this.
>
>
> PS: grep -v --and would intuitively mean "not all",
> i.e. "grep -v --and -eX -eY" would return lines matching X *or* Y, but
> omit lines matching *both* X and Y.
>
> PS: I can't decide if "--and" or "--intersection" is a better name.
> I put both in the bug subject so people searching for either will find this ticket.
> I think "--all" is probably too confusing.
>
>
>
> Workaround #1
> =============
> I can work around this by listing every possible order, but 1) this
> scales poorly with the number of patterns; and 2) it can't be used
> with -F. For example,
>
> grep --and -eX -eY -eZ input*.txt # becomes
>
> grep -eZ.*Y.*X \
> -eZ.*X.*Y \
> -eY.*Z.*X \
> -eY.*X.*Z \
> -eX.*Z.*Y \
> -eX.*Y.*Z \
> input*.txt
>
>
> Workaround #2
> =============
> I can pipe greps together. This is what I currently do.
> This is more convenient and feels faster than workaround #1, but
> I suspect the inter-process overhead is significant.
>
> If grep implemented this internally, it could zero-copy.
> Being able to "grep -rnH --and" &c would also be convenient.
>
> For example,
>
> grep --and -F -eX -eY -eZ input*.txt # becomes
>
> cat input*.txt |
> grep -F -eX |
> grep -F -eY |
> grep -F -eZ
Information forwarded
to
bug-grep <at> gnu.org
:
bug#37754
; Package
grep
.
(Wed, 16 Oct 2019 12:27:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 37754 <at> debbugs.gnu.org (full text, mbox):
On Tue, 15 Oct 2019 12:48:17 +1100
"Trent W. Buck" <trentbuck <at> gmail.com> wrote:
> Package: grep
> Version: 3.3-1
> Severity: wishlist
>
> This bug was originally reported as
> https://bugs.debian.org/940464
>
> Trent W. Buck wrote:
> > (Surely someone has already asked for this, but I can't see where.
> > I may have already reported this myself, and forgotten.
> > If so, sorry!)
> >
> > Right now if you do
> >
> > grep -eX -eY -eZ
> >
> > You'll get lines that match *any of* X, Y, or Z.
> > Quite often I want to search for lines that match *all of* X, Y, and Z ? but in any order.
> > For example,
> >
> > # all 4TB 2.5-inch SATA products
> > grep -Fwi -eSATA -e2TB -e2.5in products.csv
> >
> > Below is a short discussion of the workarounds I know about.
> >
> > Is "grep --and" something that has already been discussed and rejected?
> > I looked through debbugs.gnu.org and the source tarball, but
> > I couldn't find anything about this.
> >
> >
> > PS: grep -v --and would intuitively mean "not all",
> > i.e. "grep -v --and -eX -eY" would return lines matching X *or* Y, but
> > omit lines matching *both* X and Y.
> >
> > PS: I can't decide if "--and" or "--intersection" is a better name.
> > I put both in the bug subject so people searching for either will find this ticket.
> > I think "--all" is probably too confusing.
> >
> >
> >
> > Workaround #1
> > =============
> > I can work around this by listing every possible order, but 1) this
> > scales poorly with the number of patterns; and 2) it can't be used
> > with -F. For example,
> >
> > grep --and -eX -eY -eZ input*.txt # becomes
> >
> > grep -eZ.*Y.*X \
> > -eZ.*X.*Y \
> > -eY.*Z.*X \
> > -eY.*X.*Z \
> > -eX.*Z.*Y \
> > -eX.*Y.*Z \
> > input*.txt
> >
> >
> > Workaround #2
> > =============
> > I can pipe greps together. This is what I currently do.
> > This is more convenient and feels faster than workaround #1, but
> > I suspect the inter-process overhead is significant.
> >
> > If grep implemented this internally, it could zero-copy.
> > Being able to "grep -rnH --and" &c would also be convenient.
> >
> > For example,
> >
> > grep --and -F -eX -eY -eZ input*.txt # becomes
> >
> > cat input*.txt |
> > grep -F -eX |
> > grep -F -eY |
> > grep -F -eZ
>
Although I do not know wheter it is discussed and/or rejected, to add
the function to grep, internal conversion as workaround #1 will be
impremented in grep. However, it scales poorly as you say, and it will
be slower than workaround #2 in many cases.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#37754
; Package
grep
.
(Wed, 16 Oct 2019 18:58:01 GMT)
Full text and
rfc822 format available.
Message #11 received at 37754 <at> debbugs.gnu.org (full text, mbox):
Wouldn't it be more useful to have an intersection operator in regular
expressions? That is, the pattern 'A\&B' would match anything that is
matched by both A and B. If A and B have parenthesized subexpressions,
both sets of parentheses would match and would count.
Assuming concatenation has higher precedence than \&, the requested
behavior could be achieved via:
grep '.*X.*\&.*Y.*\&.*Z.*'
This approach would allow intersection to be nested inside other
operations. Also, it would clarify how other features work. For example,
grep -o has clear semantics with this approach, whereas the semantics of
grep -o are not so clear with the proposed --and option.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#37754
; Package
grep
.
(Thu, 17 Oct 2019 00:21:01 GMT)
Full text and
rfc822 format available.
Message #14 received at 37754 <at> debbugs.gnu.org (full text, mbox):
Paul Eggert wrote:
> Wouldn't it be more useful to have an intersection operator in regular
> expressions? That is, the pattern 'A\&B' would match anything that is
> matched by both A and B. If A and B have parenthesized subexpressions, both
> sets of parentheses would match and would count.
Not for me personally, because I almost always want to use it with -Fwi :-)
(-F is a lot faster - about as fast as LC_COLLATE - and it also means
I don't have to think about escaping special characters.)
> [...]
>
> This approach would allow intersection to be nested inside other operations.
> Also, it would clarify how other features work. For example, grep -o has
> clear semantics with this approach, whereas the semantics of grep -o are not
> so clear with the proposed --and option.
I hadn't thought about -o, and I agree that is not very obvious.
Given an input file like
30$ Gamdias EROS (M2) USB Multi-Color Lighting Gaming Headset
30$ Gamdias POSEIDON E1 Gaming Combo 3-in-1 K/B+3200dpi Optical Mouse+Stereo Headset
30$ GeIL (GP34GB1600C11SC) 4GB DDR3 1600 Desktop RAM
30$ GeIL Pristine (GP44GB2400C17SC) 4GB Single DDR4 2400 Desktop RAM
30$ GeIL SO-DIMM 4GB (GGS34GB1600C11SC) 1.35V (Low Voltage) 4GB DDR3 1600 Notebook Ram
Where currently "grep -Fw -e 4GB -e DDR4 -o" prints
4GB
4GB
DDR4
4GB
4GB
I would expect "grep -Fw -e 4GB -e DDR4 --and" to print the same thing as
grep -Fw 4GB | grep -Fw DDR4 | grep -Fw -e 4GB -e DDR4 -o
i.e.
4GB
DDR4
Information forwarded
to
bug-grep <at> gnu.org
:
bug#37754
; Package
grep
.
(Thu, 17 Oct 2019 08:28:02 GMT)
Full text and
rfc822 format available.
Message #17 received at 37754 <at> debbugs.gnu.org (full text, mbox):
On 10/16/19 5:19 PM, Trent W. Buck wrote:
> I would expect "grep -Fw -e 4GB -e DDR4 --and" to print the same thing as
>
> grep -Fw 4GB | grep -Fw DDR4 | grep -Fw -e 4GB -e DDR4 -o
You're right, it's not obvious. :-)
It may be better to just pipe greps together, as you do now. That's simple and
fast enough for this relatively-uncommon case, and it's portable to all greps.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#37754
; Package
grep
.
(Fri, 18 Oct 2019 11:50:02 GMT)
Full text and
rfc822 format available.
Message #20 received at 37754 <at> debbugs.gnu.org (full text, mbox):
Paul Eggert wrote:
> On 10/16/19 5:19 PM, Trent W. Buck wrote:
> > I would expect "grep -Fw -e 4GB -e DDR4 --and" to print the same thing as
> >
> > grep -Fw 4GB | grep -Fw DDR4 | grep -Fw -e 4GB -e DDR4 -o
>
> You're right, it's not obvious. :-)
>
> It may be better to just pipe greps together, as you do now. That's simple
> and fast enough for this relatively-uncommon case, and it's portable to all
> greps.
I admit that most of the time, I want "grep --and" for a small dataset
(<1MB computer_parts.txt), so it's merely a convenience.
Sometimes I grep audit logs (~1TB uncompressed), which takes anywhere
from 15 minutes to 3 days, depending on how I tweak my grep calls.
In that case, each grep in the pipeline has to pay the costs to
de-serialize input from the previous grep, and re-serialize output to
the next grep. If the first grep matches (say) 200GB of the 1TB,
that's can be a lot of overhead (I assume).
I was basically hoping that if it was all in a single grep process,
the de/serialization steps could be skipped completely.
I think the buzzword for that is "zero-copy"?
I've noticed "grep" is about 30% slower than either "grep -F" or
"LC_COLLATE=C grep", because (I think) it avoids the costs of decoding
from UTF-8 to Unicode and back. So I was basically expecting a
similar saving from --and.
I'm only speaking as an end user - I haven't dug through the grep
source, so those expectations might be unrealistic, and implementing
it might be painful/impossible. I figured I should at least ask :-)
If your expert opinion is that it's a pain to implement (and
maintain!) and there's not enough demand, then I'm OK with that.
This is NOT something that's burning me every day.
Regardless, I appreciate you taking the time to discuss it. :-)
PS: Regarding portability, I'm personally not worried because when I
need a GNUism badly enough (e.g. du --threshold), I can usually get
permission to install the relevant GNU software, even if it's only
into %APPDATA% or $HOME.
PS: I noticed on bugs.gnu.org something about grep being
single-threaded, which might mean "grep --and" would end up being
SLOWER than the existing pipelines, since the kernel can distribute
a pipeline's elements across multiple CPUs/cores.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#37754
; Package
grep
.
(Fri, 18 Oct 2019 17:52:03 GMT)
Full text and
rfc822 format available.
Message #23 received at 37754 <at> debbugs.gnu.org (full text, mbox):
On 10/18/19 4:49 AM, Trent W. Buck wrote:
> In that case, each grep in the pipeline has to pay the costs to
> de-serialize input from the previous grep
Sure, but grep is designed to be a simple tool and we need to draw the
line somewhere. For something more complicated there are already sed and
awk (if you want to write to POSIX) or Perl or Python or whatever.
I mildly of prefer the A\&B notation because it could be used
everywhere, not just in grep. (But of course someone would have to
implement it. :-)
Information forwarded
to
bug-grep <at> gnu.org
:
bug#37754
; Package
grep
.
(Fri, 18 Oct 2019 22:37:01 GMT)
Full text and
rfc822 format available.
Message #26 received at submit <at> debbugs.gnu.org (full text, mbox):
I'm currently working on rewriting and packaging up a tool that
I use to handle such high volume [and/or/not] filters on long lists
of file pathnames and of log file entries. It's a tool I've had in
my private toolbox for decades. I call it "ftest". It has a rich set
of "test" like flags for testing stat(2) attributes of file, but is
optimized for working in pipelines (as a filter, hence the "f").
Trent - do you need regular expression matching, or is glob matching
easily sufficient, or would even just fixed string matching be useful?
For [and/or/not] logical combinations of full regular expressions,
I'll probably continue to use awk, as Paul Eggert suggested, though
that might be because I've long been an awk user, since teaching
an awk class to other engineers inside Bell Labs, some 40 years ago.
Perhaps sometime, months into the future, I'll follow up with an
update pointing to my "ftest" command on github.
--
Paul Jackson
pj <at> usa.net
Information forwarded
to
bug-grep <at> gnu.org
:
bug#37754
; Package
grep
.
(Sat, 19 Oct 2019 06:58:01 GMT)
Full text and
rfc822 format available.
Message #29 received at 37754 <at> debbugs.gnu.org (full text, mbox):
On Tue, 15 Oct 2019 12:48:17 +1100
"Trent W. Buck" <trentbuck <at> gmail.com> wrote:
> Package: grep
> Version: 3.3-1
> Severity: wishlist
>
> This bug was originally reported as
> https://bugs.debian.org/940464
>
> Trent W. Buck wrote:
> > (Surely someone has already asked for this, but I can't see where.
> > I may have already reported this myself, and forgotten.
> > If so, sorry!)
> >
> > Right now if you do
> >
> > grep -eX -eY -eZ
> >
> > You'll get lines that match *any of* X, Y, or Z.
> > Quite often I want to search for lines that match *all of* X, Y, and Z ? but in any order.
> > For example,
> >
> > # all 4TB 2.5-inch SATA products
> > grep -Fwi -eSATA -e2TB -e2.5in products.csv
> >
> > Below is a short discussion of the workarounds I know about.
> >
> > Is "grep --and" something that has already been discussed and rejected?
> > I looked through debbugs.gnu.org and the source tarball, but
> > I couldn't find anything about this.
> >
> >
> > PS: grep -v --and would intuitively mean "not all",
> > i.e. "grep -v --and -eX -eY" would return lines matching X *or* Y, but
> > omit lines matching *both* X and Y.
> >
> > PS: I can't decide if "--and" or "--intersection" is a better name.
> > I put both in the bug subject so people searching for either will find this ticket.
> > I think "--all" is probably too confusing.
> >
> >
> >
> > Workaround #1
> > =============
> > I can work around this by listing every possible order, but 1) this
> > scales poorly with the number of patterns; and 2) it can't be used
> > with -F. For example,
> >
> > grep --and -eX -eY -eZ input*.txt # becomes
> >
> > grep -eZ.*Y.*X \
> > -eZ.*X.*Y \
> > -eY.*Z.*X \
> > -eY.*X.*Z \
> > -eX.*Z.*Y \
> > -eX.*Y.*Z \
> > input*.txt
> >
> >
> > Workaround #2
> > =============
> > I can pipe greps together. This is what I currently do.
> > This is more convenient and feels faster than workaround #1, but
> > I suspect the inter-process overhead is significant.
> >
> > If grep implemented this internally, it could zero-copy.
> > Being able to "grep -rnH --and" &c would also be convenient.
> >
> > For example,
> >
> > grep --and -F -eX -eY -eZ input*.txt # becomes
> >
> > cat input*.txt |
> > grep -F -eX |
> > grep -F -eY |
> > grep -F -eZ
>
>
> Workaround #1
> =============
> I can work around this by listing every possible order, but 1) this
> scales poorly with the number of patterns; and 2) it can't be used
> with -F. For example,
>
> grep --and -eX -eY -eZ input*.txt # becomes
>
> grep -eZ.*Y.*X \
> -eZ.*X.*Y \
> -eY.*Z.*X \
> -eY.*X.*Z \
> -eX.*Z.*Y \
> -eX.*Y.*Z \
> input*.txt
I have noticed that the above two do not necessarily produce the same results.
grep --and -e123 -e234 input*.txt
grep --and -e '123.*234' -e '234.*123' input*.txt
"1234" matches first, but it does not match second.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#37754
; Package
grep
.
(Wed, 18 Jan 2023 01:59:01 GMT)
Full text and
rfc822 format available.
Message #32 received at 37754 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Fri 18 Oct 2019 22:49:23 +1100, Trent W. Buck wrote:
> Paul Eggert wrote:
> > On 10/16/19 5:19 PM, Trent W. Buck wrote:
> > > I would expect "grep -Fw -e 4GB -e DDR4 --and" to print the same thing as
> > >
> > > grep -Fw 4GB | grep -Fw DDR4 | grep -Fw -e 4GB -e DDR4 -o
> >
> > You're right, it's not obvious. :-)
> >
> > It may be better to just pipe greps together, as you do now. That's simple
> > and fast enough for this relatively-uncommon case, and it's portable to all
> > greps.
>
> I admit that most of the time, I want "grep --and" for a small dataset
> (<1MB computer_parts.txt), so it's merely a convenience.
I noticed I forgot to attach a helper script I've been using for decades.
Here it is.
[foldr.sh (application/x-sh, attachment)]
This bug report was last modified 1 year and 121 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.