GNU bug report logs - #37754
wish for grep --and -eX -eY -eZ (X∩Y∩Z intersection, not X∪Y∪Z union)

Previous Next

Package: grep;

Reported by: "Trent W. Buck" <trentbuck <at> gmail.com>

Date: Tue, 15 Oct 2019 01:49:01 UTC

Severity: wishlist

Found in version 3.3-1

To reply to this bug, email your comments to 37754 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#37754; Package grep. (Tue, 15 Oct 2019 01:49:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to "Trent W. Buck" <trentbuck <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Tue, 15 Oct 2019 01:49:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "Trent W. Buck" <trentbuck <at> gmail.com>
To: submit <at> debbugs.gnu.org
Subject: wish for grep --and -eX -eY -eZ (X∩Y∩Z intersection, not X∪Y∪Z union)
Date: Tue, 15 Oct 2019 12:48:17 +1100
Package: grep
Version: 3.3-1
Severity: wishlist

This bug was originally reported as
https://bugs.debian.org/940464

Trent W. Buck wrote:
> (Surely someone has already asked for this, but I can't see where.
> I may have already reported this myself, and forgotten.
> If so, sorry!)
>
> Right now if you do
>
>     grep -eX -eY -eZ
>
> You'll get lines that match *any of* X, Y, or Z.
> Quite often I want to search for lines that match *all of* X, Y, and Z — but in any order.
> For example,
>
>     # all 4TB 2.5-inch SATA products
>     grep -Fwi -eSATA -e2TB -e2.5in products.csv
>
> Below is a short discussion of the workarounds I know about.
>
> Is "grep --and" something that has already been discussed and rejected?
> I looked through debbugs.gnu.org and the source tarball, but
> I couldn't find anything about this.
>
>
> PS: grep -v --and would intuitively mean "not all",
> i.e. "grep -v --and -eX -eY" would return lines matching X *or* Y, but
> omit lines matching *both* X and Y.
>
> PS: I can't decide if "--and" or "--intersection" is a better name.
> I put both in the bug subject so people searching for either will find this ticket.
> I think "--all" is probably too confusing.
>
>
>
> Workaround #1
> =============
> I can work around this by listing every possible order, but 1) this
> scales poorly with the number of patterns; and 2) it can't be used
> with -F.  For example,
>
>     grep --and -eX -eY -eZ input*.txt   # becomes
>
>     grep -eZ.*Y.*X \
>          -eZ.*X.*Y \
>          -eY.*Z.*X \
>          -eY.*X.*Z \
>          -eX.*Z.*Y \
>          -eX.*Y.*Z \
>          input*.txt
>
>
> Workaround #2
> =============
> I can pipe greps together.  This is what I currently do.
> This is more convenient and feels faster than workaround #1, but
> I suspect the inter-process overhead is significant.
>
> If grep implemented this internally, it could zero-copy.
> Being able to "grep -rnH --and" &c would also be convenient.
>
> For example,
>
>     grep --and -F -eX -eY -eZ input*.txt   # becomes
>
>     cat input*.txt |
>     grep -F -eX |
>     grep -F -eY |
>     grep -F -eZ




Information forwarded to bug-grep <at> gnu.org:
bug#37754; Package grep. (Wed, 16 Oct 2019 12:27:02 GMT) Full text and rfc822 format available.

Message #8 received at 37754 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: "Trent W. Buck" <trentbuck <at> gmail.com>
Cc: 37754 <at> debbugs.gnu.org
Subject: Re: bug#37754: wish for grep --and -eX -eY -eZ
 (X∩Y∩Z
 intersection, not X∪Y∪Z union)
Date: Wed, 16 Oct 2019 21:26:21 +0900
On Tue, 15 Oct 2019 12:48:17 +1100
"Trent W. Buck" <trentbuck <at> gmail.com> wrote:

> Package: grep
> Version: 3.3-1
> Severity: wishlist
> 
> This bug was originally reported as
> https://bugs.debian.org/940464
> 
> Trent W. Buck wrote:
> > (Surely someone has already asked for this, but I can't see where.
> > I may have already reported this myself, and forgotten.
> > If so, sorry!)
> >
> > Right now if you do
> >
> >     grep -eX -eY -eZ
> >
> > You'll get lines that match *any of* X, Y, or Z.
> > Quite often I want to search for lines that match *all of* X, Y, and Z ? but in any order.
> > For example,
> >
> >     # all 4TB 2.5-inch SATA products
> >     grep -Fwi -eSATA -e2TB -e2.5in products.csv
> >
> > Below is a short discussion of the workarounds I know about.
> >
> > Is "grep --and" something that has already been discussed and rejected?
> > I looked through debbugs.gnu.org and the source tarball, but
> > I couldn't find anything about this.
> >
> >
> > PS: grep -v --and would intuitively mean "not all",
> > i.e. "grep -v --and -eX -eY" would return lines matching X *or* Y, but
> > omit lines matching *both* X and Y.
> >
> > PS: I can't decide if "--and" or "--intersection" is a better name.
> > I put both in the bug subject so people searching for either will find this ticket.
> > I think "--all" is probably too confusing.
> >
> >
> >
> > Workaround #1
> > =============
> > I can work around this by listing every possible order, but 1) this
> > scales poorly with the number of patterns; and 2) it can't be used
> > with -F.  For example,
> >
> >     grep --and -eX -eY -eZ input*.txt   # becomes
> >
> >     grep -eZ.*Y.*X \
> >          -eZ.*X.*Y \
> >          -eY.*Z.*X \
> >          -eY.*X.*Z \
> >          -eX.*Z.*Y \
> >          -eX.*Y.*Z \
> >          input*.txt
> >
> >
> > Workaround #2
> > =============
> > I can pipe greps together.  This is what I currently do.
> > This is more convenient and feels faster than workaround #1, but
> > I suspect the inter-process overhead is significant.
> >
> > If grep implemented this internally, it could zero-copy.
> > Being able to "grep -rnH --and" &c would also be convenient.
> >
> > For example,
> >
> >     grep --and -F -eX -eY -eZ input*.txt   # becomes
> >
> >     cat input*.txt |
> >     grep -F -eX |
> >     grep -F -eY |
> >     grep -F -eZ
> 

Although I do not know wheter it is discussed and/or rejected, to add
the function to grep, internal conversion as workaround #1 will be
impremented in grep.  However, it scales poorly as you say, and it will
be slower than workaround #2 in many cases.





Information forwarded to bug-grep <at> gnu.org:
bug#37754; Package grep. (Wed, 16 Oct 2019 18:58:01 GMT) Full text and rfc822 format available.

Message #11 received at 37754 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, "Trent W. Buck" <trentbuck <at> gmail.com>
Cc: 37754 <at> debbugs.gnu.org
Subject: Re: bug#37754: wish for grep --and -eX -eY -eZ (X∩Y∩Z intersection, not X∪Y∪Z union)
Date: Wed, 16 Oct 2019 11:57:31 -0700
Wouldn't it be more useful to have an intersection operator in regular 
expressions? That is, the pattern 'A\&B' would match anything that is 
matched by both A and B. If A and B have parenthesized subexpressions, 
both sets of parentheses would match and would count.

Assuming concatenation has higher precedence than \&, the requested 
behavior could be achieved via:

  grep '.*X.*\&.*Y.*\&.*Z.*'

This approach would allow intersection to be nested inside other 
operations. Also, it would clarify how other features work. For example, 
grep -o has clear semantics with this approach, whereas the semantics of 
grep -o are not so clear with the proposed --and option.




Information forwarded to bug-grep <at> gnu.org:
bug#37754; Package grep. (Thu, 17 Oct 2019 00:21:01 GMT) Full text and rfc822 format available.

Message #14 received at 37754 <at> debbugs.gnu.org (full text, mbox):

From: "Trent W. Buck" <trentbuck <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, 37754 <at> debbugs.gnu.org
Subject: Re: bug#37754: wish for grep --and -eX -eY -eZ (X∩Y∩Z intersection, not X∪Y∪Z union)
Date: Thu, 17 Oct 2019 11:19:54 +1100
Paul Eggert wrote:
> Wouldn't it be more useful to have an intersection operator in regular
> expressions? That is, the pattern 'A\&B' would match anything that is
> matched by both A and B. If A and B have parenthesized subexpressions, both
> sets of parentheses would match and would count.

Not for me personally, because I almost always want to use it with -Fwi :-)

(-F is a lot faster - about as fast as LC_COLLATE - and it also means
I don't have to think about escaping special characters.)

> [...]
>
> This approach would allow intersection to be nested inside other operations.
> Also, it would clarify how other features work. For example, grep -o has
> clear semantics with this approach, whereas the semantics of grep -o are not
> so clear with the proposed --and option.

I hadn't thought about -o, and I agree that is not very obvious.

Given an input file like

   30$	Gamdias EROS (M2) USB Multi-Color Lighting Gaming Headset
   30$	Gamdias POSEIDON E1 Gaming Combo 3-in-1 K/B+3200dpi Optical Mouse+Stereo Headset
   30$	GeIL (GP34GB1600C11SC) 4GB DDR3 1600 Desktop RAM
   30$	GeIL Pristine (GP44GB2400C17SC) 4GB Single DDR4 2400 Desktop RAM
   30$	GeIL SO-DIMM 4GB (GGS34GB1600C11SC) 1.35V (Low Voltage) 4GB DDR3 1600 Notebook Ram

Where currently "grep -Fw -e 4GB -e DDR4 -o" prints

    4GB
    4GB
    DDR4
    4GB
    4GB

I would expect "grep -Fw -e 4GB -e DDR4 --and" to print the same thing as

    grep -Fw 4GB | grep -Fw DDR4 | grep -Fw -e 4GB -e DDR4 -o

i.e.

    4GB
    DDR4




Information forwarded to bug-grep <at> gnu.org:
bug#37754; Package grep. (Thu, 17 Oct 2019 08:28:02 GMT) Full text and rfc822 format available.

Message #17 received at 37754 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: "Trent W. Buck" <trentbuck <at> gmail.com>
Cc: 37754 <at> debbugs.gnu.org
Subject: Re: bug#37754: wish for grep --and -eX -eY -eZ (X∩Y∩Z intersection, not X∪Y∪Z union)
Date: Thu, 17 Oct 2019 01:27:35 -0700
On 10/16/19 5:19 PM, Trent W. Buck wrote:
> I would expect "grep -Fw -e 4GB -e DDR4 --and" to print the same thing as
> 
>      grep -Fw 4GB | grep -Fw DDR4 | grep -Fw -e 4GB -e DDR4 -o

You're right, it's not obvious. :-)

It may be better to just pipe greps together, as you do now. That's simple and 
fast enough for this relatively-uncommon case, and it's portable to all greps.




Information forwarded to bug-grep <at> gnu.org:
bug#37754; Package grep. (Fri, 18 Oct 2019 11:50:02 GMT) Full text and rfc822 format available.

Message #20 received at 37754 <at> debbugs.gnu.org (full text, mbox):

From: "Trent W. Buck" <trentbuck <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 37754 <at> debbugs.gnu.org
Subject: Re: bug#37754: wish for grep --and -eX -eY -eZ (X∩Y∩Z intersection, not X∪Y∪Z union)
Date: Fri, 18 Oct 2019 22:49:23 +1100
Paul Eggert wrote:
> On 10/16/19 5:19 PM, Trent W. Buck wrote:
> > I would expect "grep -Fw -e 4GB -e DDR4 --and" to print the same thing as
> >
> >      grep -Fw 4GB | grep -Fw DDR4 | grep -Fw -e 4GB -e DDR4 -o
>
> You're right, it's not obvious. :-)
>
> It may be better to just pipe greps together, as you do now. That's simple
> and fast enough for this relatively-uncommon case, and it's portable to all
> greps.

I admit that most of the time, I want "grep --and" for a small dataset
(<1MB computer_parts.txt), so it's merely a convenience.

Sometimes I grep audit logs (~1TB uncompressed), which takes anywhere
from 15 minutes to 3 days, depending on how I tweak my grep calls.

In that case, each grep in the pipeline has to pay the costs to
de-serialize input from the previous grep, and re-serialize output to
the next grep.  If the first grep matches (say) 200GB of the 1TB,
that's can be a lot of overhead (I assume).

I was basically hoping that if it was all in a single grep process,
the de/serialization steps could be skipped completely.
I think the buzzword for that is "zero-copy"?

I've noticed "grep" is about 30% slower than either "grep -F" or
"LC_COLLATE=C grep", because (I think) it avoids the costs of decoding
from UTF-8 to Unicode and back.  So I was basically expecting a
similar saving from --and.

I'm only speaking as an end user - I haven't dug through the grep
source, so those expectations might be unrealistic, and implementing
it might be painful/impossible.  I figured I should at least ask :-)

If your expert opinion is that it's a pain to implement (and
maintain!) and there's not enough demand, then I'm OK with that.
This is NOT something that's burning me every day.

Regardless, I appreciate you taking the time to discuss it. :-)


PS: Regarding portability, I'm personally not worried because when I
need a GNUism badly enough (e.g. du --threshold), I can usually get
permission to install the relevant GNU software, even if it's only
into %APPDATA% or $HOME.

PS: I noticed on bugs.gnu.org something about grep being
single-threaded, which might mean "grep --and" would end up being
SLOWER than the existing pipelines, since the kernel can distribute
a pipeline's elements across multiple CPUs/cores.




Information forwarded to bug-grep <at> gnu.org:
bug#37754; Package grep. (Fri, 18 Oct 2019 17:52:03 GMT) Full text and rfc822 format available.

Message #23 received at 37754 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: "Trent W. Buck" <trentbuck <at> gmail.com>
Cc: 37754 <at> debbugs.gnu.org
Subject: Re: bug#37754: wish for grep --and -eX -eY -eZ (X∩Y∩Z intersection, not X∪Y∪Z union)
Date: Fri, 18 Oct 2019 10:51:29 -0700
On 10/18/19 4:49 AM, Trent W. Buck wrote:
> In that case, each grep in the pipeline has to pay the costs to
> de-serialize input from the previous grep

Sure, but grep is designed to be a simple tool and we need to draw the 
line somewhere. For something more complicated there are already sed and 
awk (if you want to write to POSIX) or Perl or Python or whatever.

I mildly of prefer the A\&B notation because it could be used 
everywhere, not just in grep. (But of course someone would have to 
implement it. :-)




Information forwarded to bug-grep <at> gnu.org:
bug#37754; Package grep. (Fri, 18 Oct 2019 22:37:01 GMT) Full text and rfc822 format available.

Message #26 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "Paul Jackson" <pj <at> usa.net>
To: bug-grep <at> gnu.org
Subject: Re: bug#37754: wish for grep --and -eX -eY -eZ (X∩Y∩Z intersection, not X∪Y∪Z union)
Date: Fri, 18 Oct 2019 17:35:38 -0500
I'm currently working on rewriting and packaging up a tool that
I use to handle such high volume [and/or/not] filters on long lists
of file pathnames and of log file entries.  It's a tool I've had in
my private toolbox for decades.  I call it "ftest".   It has a rich set
of "test" like flags for testing stat(2) attributes of file, but is
optimized for working in pipelines (as a filter, hence the "f").

Trent - do you need regular expression matching, or is glob matching
easily sufficient, or would even just fixed string matching be useful?

For [and/or/not] logical combinations of full regular expressions, 
I'll probably continue to use awk, as Paul Eggert suggested, though
that might be because I've long been an awk user, since teaching
an awk class to other engineers inside Bell Labs, some 40 years ago.

Perhaps sometime, months into the future, I'll follow up with an
update pointing to my "ftest" command on github.

-- 
                Paul Jackson
                pj <at> usa.net




Information forwarded to bug-grep <at> gnu.org:
bug#37754; Package grep. (Sat, 19 Oct 2019 06:58:01 GMT) Full text and rfc822 format available.

Message #29 received at 37754 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: "Trent W. Buck" <trentbuck <at> gmail.com>
Cc: 37754 <at> debbugs.gnu.org
Subject: Re: bug#37754: wish for grep --and -eX -eY -eZ
 (X∩Y∩Z
 intersection, not X∪Y∪Z union)
Date: Sat, 19 Oct 2019 15:57:39 +0900
On Tue, 15 Oct 2019 12:48:17 +1100
"Trent W. Buck" <trentbuck <at> gmail.com> wrote:

> Package: grep
> Version: 3.3-1
> Severity: wishlist
> 
> This bug was originally reported as
> https://bugs.debian.org/940464
> 
> Trent W. Buck wrote:
> > (Surely someone has already asked for this, but I can't see where.
> > I may have already reported this myself, and forgotten.
> > If so, sorry!)
> >
> > Right now if you do
> >
> >     grep -eX -eY -eZ
> >
> > You'll get lines that match *any of* X, Y, or Z.
> > Quite often I want to search for lines that match *all of* X, Y, and Z ? but in any order.
> > For example,
> >
> >     # all 4TB 2.5-inch SATA products
> >     grep -Fwi -eSATA -e2TB -e2.5in products.csv
> >
> > Below is a short discussion of the workarounds I know about.
> >
> > Is "grep --and" something that has already been discussed and rejected?
> > I looked through debbugs.gnu.org and the source tarball, but
> > I couldn't find anything about this.
> >
> >
> > PS: grep -v --and would intuitively mean "not all",
> > i.e. "grep -v --and -eX -eY" would return lines matching X *or* Y, but
> > omit lines matching *both* X and Y.
> >
> > PS: I can't decide if "--and" or "--intersection" is a better name.
> > I put both in the bug subject so people searching for either will find this ticket.
> > I think "--all" is probably too confusing.
> >
> >
> >
> > Workaround #1
> > =============
> > I can work around this by listing every possible order, but 1) this
> > scales poorly with the number of patterns; and 2) it can't be used
> > with -F.  For example,
> >
> >     grep --and -eX -eY -eZ input*.txt   # becomes
> >
> >     grep -eZ.*Y.*X \
> >          -eZ.*X.*Y \
> >          -eY.*Z.*X \
> >          -eY.*X.*Z \
> >          -eX.*Z.*Y \
> >          -eX.*Y.*Z \
> >          input*.txt
> >
> >
> > Workaround #2
> > =============
> > I can pipe greps together.  This is what I currently do.
> > This is more convenient and feels faster than workaround #1, but
> > I suspect the inter-process overhead is significant.
> >
> > If grep implemented this internally, it could zero-copy.
> > Being able to "grep -rnH --and" &c would also be convenient.
> >
> > For example,
> >
> >     grep --and -F -eX -eY -eZ input*.txt   # becomes
> >
> >     cat input*.txt |
> >     grep -F -eX |
> >     grep -F -eY |
> >     grep -F -eZ
> 
> 

> Workaround #1
> =============
> I can work around this by listing every possible order, but 1) this
> scales poorly with the number of patterns; and 2) it can't be used
> with -F.  For example,
>
>     grep --and -eX -eY -eZ input*.txt   # becomes
>
>     grep -eZ.*Y.*X \
>          -eZ.*X.*Y \
>          -eY.*Z.*X \
>          -eY.*X.*Z \
>          -eX.*Z.*Y \
>          -eX.*Y.*Z \
>          input*.txt

I have noticed that the above two do not necessarily produce the same results.

    grep --and -e123 -e234 input*.txt

    grep --and -e '123.*234' -e '234.*123' input*.txt

"1234" matches first, but it does not match second. 





Information forwarded to bug-grep <at> gnu.org:
bug#37754; Package grep. (Wed, 18 Jan 2023 01:59:01 GMT) Full text and rfc822 format available.

Message #32 received at 37754 <at> debbugs.gnu.org (full text, mbox):

From: "Trent W. Buck" <trentbuck <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 37754 <at> debbugs.gnu.org
Subject: Re: bug#37754: wish for grep --and -eX -eY -eZ (X∩Y∩Z intersection, not X∪Y∪Z union)
Date: Wed, 18 Jan 2023 12:57:53 +1100
[Message part 1 (text/plain, inline)]
On Fri 18 Oct 2019 22:49:23 +1100, Trent W. Buck wrote:
> Paul Eggert wrote:
> > On 10/16/19 5:19 PM, Trent W. Buck wrote:
> > > I would expect "grep -Fw -e 4GB -e DDR4 --and" to print the same thing as
> > >
> > >      grep -Fw 4GB | grep -Fw DDR4 | grep -Fw -e 4GB -e DDR4 -o
> >
> > You're right, it's not obvious. :-)
> >
> > It may be better to just pipe greps together, as you do now. That's simple
> > and fast enough for this relatively-uncommon case, and it's portable to all
> > greps.
> 
> I admit that most of the time, I want "grep --and" for a small dataset
> (<1MB computer_parts.txt), so it's merely a convenience.

I noticed I forgot to attach a helper script I've been using for decades.
Here it is.
[foldr.sh (application/x-sh, attachment)]

This bug report was last modified 1 year and 121 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.