GNU bug report logs - #16871
problems about matching newline (with -z)

Previous Next

Package: grep;

Reported by: Stephane Chazelas <stephane.chazelas <at> gmail.com>

Date: Tue, 25 Feb 2014 07:33:01 UTC

Severity: wishlist

To reply to this bug, email your comments to 16871 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#16871; Package grep. (Tue, 25 Feb 2014 07:33:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Stephane Chazelas <stephane.chazelas <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Tue, 25 Feb 2014 07:33:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: problems about matching newline (with -z)
Date: Tue, 25 Feb 2014 07:32:18 +0000
The doc has a confusing statement:

> 15. How can I match across lines?
>
>    Standard grep cannot do this, as it is fundamentally line-based.
>    Therefore, merely using the '[:space:]' character class does not
>    match newlines in the way you might expect.  However, if your grep
>    is compiled with Perl patterns enabled, the Perl 's' modifier
>    (which makes '.' match newlines) can be used:
>
>         printf 'foo\nbar\n' | grep -P '(?s)foo.*?bar'
>
>    With the GNU 'grep' option '-z' (*note File and Directory
>    Selection::), the input is terminated by null bytes.  Thus, you can
>    match newlines in the input, but the output will be the whole file,
>    so this is really only useful to determine if the pattern is
>    present:
>
>         printf 'foo\nbar\n' | grep -z -q 'foo[[:space:]]\+bar'
>
>    Failing either of those options, you need to transform the input
>    before giving it to 'grep', or turn to 'awk', 'sed', 'perl', or
>    many other utilities that are designed to operate across lines.

printf 'foo\nbar\n' | grep -P '(?s)foo.*?bar'

Will never match as it's line-based even with -P. -P doesn't
help here, it makes it harder as you need that (?s).

printf 'foo\nbar\n\0' | grep -z 'foo.*bar'

would match.

Same confusion in tests/pcre:

> #! /bin/sh
> # Ensure that with -P, \s*$ matches a newline.
> #
> # Copyright (C) 2001, 2006, 2009-2014 Free Software Foundation, Inc.
> #
> # Copying and distribution of this file, with or without modification,
> # are permitted in any medium without royalty provided the copyright
> # notice and this notice are preserved.
> 
> . "${srcdir=.}/init.sh"; path_prepend_ ../src
> require_pcre_
> 
> fail=0
> 
> # See CVS revision 1.32 of "src/search.c".
> echo | grep -P '\s*$' || fail=1
> 
> Exit $fail

'\s*$' doesn't match a newline, but an empty string.

You need echo | grep -zP '\s' to match the newline.

Also:

We can match a newline with grep -zP 'a\nb' (or '\x0a' or '\012'
or '[\n]'...) but not easily without -P. Same for NUL
characters.

Without -P, the only way I could think of was with
[^\0-\011\013-\377], but that would only work for single-byte
locales, and you can't pass a nul character on the command line,
so it would have to be with -f but:

$ printf 'a\nb\0' | LC_ALL=C grep -zf <(LC_ALL=C printf 'a[^\0-\011\013-\377]b')
zsh: done                printf 'a\nb\0' |
zsh: segmentation fault  LC_ALL=C grep -zf <(LC_ALL=C printf 'a[^\0-\011\013-\377]b')

Having said that:

grep -z $'a[^\01-\011\013-\0377]b'

would work (in single-byte locales) since nul is not in the
input since it's the delimiter.

and grep -a $'[^\01-\0377]' can match nul (in single-byte
locales).

But it would be handly to be able to do the same as with -P.

-- 
Stephane




Information forwarded to bug-grep <at> gnu.org:
bug#16871; Package grep. (Tue, 25 Feb 2014 11:34:01 GMT) Full text and rfc822 format available.

Message #8 received at 16871 <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: 16871 <at> debbugs.gnu.org
Subject: Re: bug#16871: Acknowledgement (problems about matching newline
 (with -z))
Date: Tue, 25 Feb 2014 11:32:43 +0000
Also:

$ printf 'a\nb\0' | grep -z 'a$'
$ printf 'a\nb\0' | grep -zP 'a$'
a
b
$ printf 'a\nb\0' | grep -zxP a
a
b

Why use PCRE_MULTILINE here?




Information forwarded to bug-grep <at> gnu.org:
bug#16871; Package grep. (Fri, 25 Apr 2014 04:28:02 GMT) Full text and rfc822 format available.

Message #11 received at 16871 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Stephane Chazelas <stephane.chazelas <at> gmail.com>, 
 16871 <at> debbugs.gnu.org
Subject: Re: bug#16871: problems about matching newline (with -z)
Date: Thu, 24 Apr 2014 21:27:38 -0700
[Message part 1 (text/plain, inline)]
Stephane Chazelas wrote:
> The doc has a confusing statement ... Same confusion in tests/pcre:

Thanks, I installed the attached patch to fix those.

> We can match a newline with grep -zP 'a\nb' (or '\x0a' or '\012'
> or '[\n]'...) but not easily without -P. Same for NUL
> characters.

Yes, that's a downside of the POSIX notation, and it'd be nice to extend 
POSIX to allow easy matching for newlines and/or null bytes.  I'll mark 
this bug report as a wishlist bug.

[0001-misc-fix-doc-and-test-bugs-re-grep-z.patch (text/plain, attachment)]

Severity set to 'wishlist' from 'normal' Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Fri, 25 Apr 2014 04:29:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#16871; Package grep. (Fri, 18 Nov 2016 17:41:01 GMT) Full text and rfc822 format available.

Message #16 received at 16871 <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: 16871 <at> debbugs.gnu.org
Subject: doc/test confusions with grep -P
Date: Fri, 18 Nov 2016 17:40:44 +0000
For the record, the doc/test confusion was fixed by commit
b73296ace186451b096b075461634c153d1fa525
http://git.savannah.gnu.org/cgit/grep.git/commit/?id=b73296ace186451b096b075461634c153d1fa525

See also https://debbugs.gnu.org/cgi/bugreport.cgi?bug=22655#47
and below about PCRE_MULTILINE.




This bug report was last modified 7 years and 185 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.