GNU bug report logs - #18266
grep -P and invalid exits with error

Previous Next

Package: grep;

Reported by: Santiago <santiago <at> debian.org>

Date: Thu, 14 Aug 2014 15:43:02 UTC

Severity: wishlist

Merged with 18455

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 18266 in the body.
You can then email your comments to 18266 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Thu, 14 Aug 2014 15:43:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Santiago <santiago <at> debian.org>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Thu, 14 Aug 2014 15:43:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Santiago <santiago <at> debian.org>
To: bug-grep <at> gnu.org
Cc: 758105 <at> bugs.debian.org
Subject: grep -P and invalid exits with error 
Date: Thu, 14 Aug 2014 17:42:57 +0200
Hi,

Please, revert ca7868cc27db3d9deafaa2e0ac5a2bb0aa8ef373

That commit (re)introduced a regression bug (See http://debbugs.gnu.org/15758).
pcresearch checks again if input is UTF-8 valid. The problem is that
binary files are utf-8 invalid, so grep -P, in unicode locales, exits
with error:

LANG=en_US.UTF-8 grep -P -r x /usr/bin/
grep: invalid UTF-8 byte sequence in input



printf 'j\x82\nj\n'|LC_ALL=en_US.UTF-8 grep -P j|cat -A; echo $?
grep: invalid UTF-8 byte sequence in input
0

should be:
printf 'j\x82\nj\n'|LC_ALL=en_US.UTF-8 src/grep -P j|cat -A; echo $?
jM-^B$
j$
0

Tested on Debian and Archlinux with pcre 8.35.

Thanks,

Santiago





Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Thu, 14 Aug 2014 16:17:01 GMT) Full text and rfc822 format available.

Message #8 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Santiago <santiago <at> debian.org>, 18266 <at> debbugs.gnu.org
Cc: 758105 <at> bugs.debian.org
Subject: Re: bug#18266: grep -P and invalid exits with error
Date: Thu, 14 Aug 2014 09:15:58 -0700
Santiago wrote:
> Please, revert ca7868cc27db3d9deafaa2e0ac5a2bb0aa8ef373

That commit was necessary to avoid undefined behavior in libpcre.  We 
can't simply undo the commit (unless you want to reintroduce security 
holes into grep :-).  The current behavior is the best we can do, unless 
someone fixes libpcre (which doesn't appear to be likely), or unless 
someone takes the time to write code in grep to work around the problem.

One way forward is suggested in <http://bugs.gnu.org/17245#43>.  No 
doubt there are others.  Can you suggest a volunteer to take this on?




Severity set to 'wishlist' from 'normal' Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Thu, 14 Aug 2014 16:23:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Thu, 14 Aug 2014 17:45:02 GMT) Full text and rfc822 format available.

Message #13 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>, 758105 <at> bugs.debian.org
Cc: Santiago <santiago <at> debian.org>, 18266 <at> debbugs.gnu.org
Subject: Re: Bug#758105: bug#18266: grep -P and invalid exits with error
Date: Thu, 14 Aug 2014 19:44:42 +0200
On 2014-08-14 09:15:58 -0700, Paul Eggert wrote:
> That commit was necessary to avoid undefined behavior in libpcre.  We can't
> simply undo the commit (unless you want to reintroduce security holes into
> grep :-).  The current behavior is the best we can do, unless someone fixes
> libpcre (which doesn't appear to be likely), or unless someone takes the
> time to write code in grep to work around the problem.
> 
> One way forward is suggested in <http://bugs.gnu.org/17245#43>.  No doubt
> there are others.  Can you suggest a volunteer to take this on?

Discarding input lines with invalid UTF-8 sequences is not OK.
IMHO, it would be better to replace invalid UTF-8 sequences by
zero bytes before passing them to libpcre. Is it allowed to do
that in Pexecute()?

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Thu, 14 Aug 2014 18:20:01 GMT) Full text and rfc822 format available.

Message #16 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Vincent Lefevre <vincent <at> vinc17.net>, 758105 <at> bugs.debian.org
Cc: Santiago <santiago <at> debian.org>, 18266 <at> debbugs.gnu.org
Subject: Re: Bug#758105: bug#18266: grep -P and invalid exits with error
Date: Thu, 14 Aug 2014 11:19:28 -0700
Vincent Lefevre wrote:

> it would be better to replace invalid UTF-8 sequences by
> zero bytes before passing them to libpcre. Is it allowed to do
> that in Pexecute()?

Sorry, I don't know.  I was hoping that the volunteer (whoever it is) 
could figure all this stuff out.

grep should work correctly even if the input contains NUL bytes, so 
perhaps it would be better to replace an invalid byte by the UTF-8 
sequence for U+FFFD REPLACEMENT CHARACTER, as that's one standard way to 
deal with this problem.  Or perhaps the volunteer will have a better idea.




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Thu, 14 Aug 2014 20:12:02 GMT) Full text and rfc822 format available.

Message #19 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 18266 <at> debbugs.gnu.org, Santiago <santiago <at> debian.org>,
 758105 <at> bugs.debian.org
Subject: Re: Bug#758105: bug#18266: grep -P and invalid exits with error
Date: Thu, 14 Aug 2014 22:11:48 +0200
On 2014-08-14 11:19:28 -0700, Paul Eggert wrote:
> grep should work correctly even if the input contains NUL bytes, so perhaps
> it would be better to replace an invalid byte by the UTF-8 sequence for
> U+FFFD REPLACEMENT CHARACTER, as that's one standard way to deal with this
> problem.  Or perhaps the volunteer will have a better idea.

The problem with this solution is that it would change the length
of the text, while replacing invalid bytes by zero bytes could be
done in place (if allowed), with very little change of the code,
I think.

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Thu, 14 Aug 2014 20:15:01 GMT) Full text and rfc822 format available.

Message #22 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: 18266 <at> debbugs.gnu.org, Santiago <santiago <at> debian.org>,
 758105 <at> bugs.debian.org
Subject: Re: Bug#758105: bug#18266: grep -P and invalid exits with error
Date: Thu, 14 Aug 2014 13:13:45 -0700
Vincent Lefevre wrote:
> The problem with this solution is that it would change the length
> of the text, while replacing invalid bytes by zero bytes could be
> done in place (if allowed), with very little change of the code,
> I think.

True.  Though it might be more user-friendly to use '?' as the 
replacement byte.




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Thu, 14 Aug 2014 21:04:02 GMT) Full text and rfc822 format available.

Message #25 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 18266 <at> debbugs.gnu.org, Santiago <santiago <at> debian.org>,
 758105 <at> bugs.debian.org
Subject: Re: Bug#758105: bug#18266: grep -P and invalid exits with error
Date: Thu, 14 Aug 2014 23:03:50 +0200
On 2014-08-14 13:13:45 -0700, Paul Eggert wrote:
> Vincent Lefevre wrote:
> >The problem with this solution is that it would change the length
> >of the text, while replacing invalid bytes by zero bytes could be
> >done in place (if allowed), with very little change of the code,
> >I think.
> 
> True. Though it might be more user-friendly to use '?' as the
> replacement byte.

On output, yes (though in most cases, non-printable characters are
probably seen as garbage and don't really matter); and when the lines
are not printed, this doesn't matter.

On input, using null bytes may be better if one wants to be able to
match real replacement characters without false positives. Matching
null bytes is not common, AFAIK.

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Thu, 14 Aug 2014 21:34:02 GMT) Full text and rfc822 format available.

Message #28 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: 18266 <at> debbugs.gnu.org, 758105 <at> bugs.debian.org
Subject: Re: bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with
 error
Date: Thu, 14 Aug 2014 14:33:36 -0700
Vincent Lefevre wrote:
> On input, using null bytes may be better if one wants to be able to
> match real replacement characters without false positives.

Maybe, though this is no place to get fancy.  It's simple to tell users 
"an invalid byte acts like '?'".  Simple is good.

Anyway, this is a matter for the implementing volunteer to decide, 
whoever that happens to be.




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Sat, 16 Aug 2014 14:01:02 GMT) Full text and rfc822 format available.

Message #31 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Santiago <santiago <at> debian.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>, 758105 <at> bugs.debian.org
Cc: 18266 <at> debbugs.gnu.org, Vincent Lefevre <vincent <at> vinc17.net>
Subject: Re: Bug#758105: bug#18266: Bug#758105: bug#18266: grep -P and
 invalid exits with error
Date: Sat, 16 Aug 2014 16:01:27 +0200
[Message part 1 (text/plain, inline)]
El 14/08/14 a las 14:33, Paul Eggert escribió:
> Vincent Lefevre wrote:
> >On input, using null bytes may be better if one wants to be able to
> >match real replacement characters without false positives.
> 
> Maybe, though this is no place to get fancy.  It's simple to tell users "an
> invalid byte acts like '?'".  Simple is good.
> 
> Anyway, this is a matter for the implementing volunteer to decide, whoever
> that happens to be.
> 

Workaround attached. It's too slow against binary files, but I haven't
found a simpler solution.

What do you think?

Santiago
[grep-pcresearch-clean-utf8-1.patch (text/x-diff, attachment)]
[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Sat, 16 Aug 2014 16:27:02 GMT) Full text and rfc822 format available.

Message #34 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Santiago <santiago <at> debian.org>
Cc: 18266 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>,
 758105 <at> bugs.debian.org
Subject: Re: Bug#758105: bug#18266: Bug#758105: bug#18266: grep -P and
 invalid exits with error
Date: Sat, 16 Aug 2014 18:26:21 +0200
On 2014-08-16 16:01:27 +0200, Santiago wrote:
> Workaround attached. It's too slow against binary files, but I haven't
> found a simpler solution.

To avoid the slowness, I think that it would be better to detect
(directly, not via PCRE) invalid UTF-8 sequences and replace them
by null bytes *in-place*.

It might slow down the general case, though. However I'm not sure,
because if the UTF8 validity check (via the replacement of invalid
sequences) is done in grep, it doesn't need to be done in PCRE.

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Sat, 16 Aug 2014 17:57:02 GMT) Full text and rfc822 format available.

Message #37 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Santiago <santiago <at> debian.org>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: 18266 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>,
 758105 <at> bugs.debian.org
Subject: Re: Bug#758105: bug#18266: Bug#758105: bug#18266: grep -P and
 invalid exits with error
Date: Sat, 16 Aug 2014 19:56:37 +0200
El 16/08/14 a las 18:26, Vincent Lefevre escribió:
> On 2014-08-16 16:01:27 +0200, Santiago wrote:
> > Workaround attached. It's too slow against binary files, but I haven't
> > found a simpler solution.
> 
> To avoid the slowness, I think that it would be better to detect
> (directly, not via PCRE) invalid UTF-8 sequences and replace them
> by null bytes *in-place*.
> 
> It might slow down the general case, though. However I'm not sure,
> because if the UTF8 validity check (via the replacement of invalid
> sequences) is done in grep, it doesn't need to be done in PCRE.
> 

I think that'd require a similar work to replace the "invalid" content
from binary files.

Another solution would be to don't check if binary files are valid
(passing PCRE_NO_UTF8_CHECK to pcre_exec), but I don't know if that'd
avoid security holes, and I don't know how to do it either.

Regards,

Santiago




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Sat, 16 Aug 2014 18:37:02 GMT) Full text and rfc822 format available.

Message #40 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Santiago <santiago <at> debian.org>, Vincent Lefevre <vincent <at> vinc17.net>
Cc: 18266 <at> debbugs.gnu.org, 758105 <at> bugs.debian.org
Subject: Re: Bug#758105: bug#18266: Bug#758105: bug#18266: grep -P and invalid
 exits with error
Date: Sat, 16 Aug 2014 11:36:28 -0700
Santiago wrote:
> Another solution would be to don't check if binary files are valid
> (passing PCRE_NO_UTF8_CHECK to pcre_exec), but I don't know if that'd
> avoid security holes

It wouldn't.  (We already tried it.)




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Fri, 29 Aug 2014 05:48:02 GMT) Full text and rfc822 format available.

Message #43 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Santiago <santiago <at> debian.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>, 758105 <at> bugs.debian.org
Cc: 18266 <at> debbugs.gnu.org, Vincent Lefevre <vincent <at> vinc17.net>
Subject: Re: grep -P and invalid exits with error
Date: Thu, 28 Aug 2014 22:47:54 -0700
[Message part 1 (text/plain, inline)]
El 16/08/14 a las 11:36, Paul Eggert escribió:
> Santiago wrote:
> >Another solution would be to don't check if binary files are valid
> >(passing PCRE_NO_UTF8_CHECK to pcre_exec), but I don't know if that'd
> >avoid security holes
> 
> It wouldn't.  (We already tried it.)
> 

Another try. This patch is by far more efficient.
With the previous patch #1:

 % time grep -P faz /usr/bin/*                                            
 ...
 grep: /usr/bin/X11: Es un directorio
 grep -P faz /usr/bin/*  519,78s user 0,32s system 99% cpu 8:41,19 total

 With this one:

  % time src/grep -P faz /usr/bin/*
  src/grep -P faz /usr/bin/*  7,36s user 0,33s system 99% cpu 7,695 total

Cheers,

Santiago
[grep-pcresearch-clean-utf8-2.patch (text/x-diff, attachment)]
[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Fri, 29 Aug 2014 12:59:01 GMT) Full text and rfc822 format available.

Message #46 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Santiago <santiago <at> debian.org>, Paul Eggert <eggert <at> cs.ucla.edu>,
 758105 <at> bugs.debian.org
Cc: 18266 <at> debbugs.gnu.org, Vincent Lefevre <vincent <at> vinc17.net>
Subject: Re: bug#18266: grep -P and invalid exits with error
Date: Fri, 29 Aug 2014 06:58:17 -0600
[Message part 1 (text/plain, inline)]
On 08/28/2014 11:47 PM, Santiago wrote:
> El 16/08/14 a las 11:36, Paul Eggert escribió:
>> > Santiago wrote:
>>> > >Another solution would be to don't check if binary files are valid
>>> > >(passing PCRE_NO_UTF8_CHECK to pcre_exec), but I don't know if that'd
>>> > >avoid security holes
>> > 
>> > It wouldn't.  (We already tried it.)
>> > 
> Another try. This patch is by far more efficient.

> * src/pcresearch.c (Pexecute): When pcre_exec returns an invalid
> UTF8 character error, copies line_buf to an auxiliar buffer,

s/auxiliar/auxiliary/

> removes invalid characters and evaluates against it.
> * tests/pcre-infloop: Exit status is 1 again.
> * tests/pcre-invalid-utf8-input: Check again if grep doesn't
> abort. Also cheks for match after a second invalid character

s/cheks/checks/


> +          /* Change invalid UTF-8 characters (according to pcre_exec) to '\0' */
> +          while (e == PCRE_ERROR_BADUTF8){

Space before {

> +            line_utf8_clean[sub[0]+invalid_pos] = '\0';

Spaces around +

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Fri, 29 Aug 2014 13:45:02 GMT) Full text and rfc822 format available.

Message #49 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Santiago <santiago <at> debian.org>, 758105 <at> bugs.debian.org
Cc: 18266 <at> debbugs.gnu.org, Vincent Lefevre <vincent <at> vinc17.net>
Subject: Re: grep -P and invalid exits with error
Date: Fri, 29 Aug 2014 06:43:45 -0700
Thanks, but that patch seems to depend on libpcre internals, in that it 
"knows" that pcre_exec cannot possibly succeed without first checking 
its entire input buffer for invalid UTF-8 bytes.  Even if that's true 
now, it reflects a performance bug that might be fixed in a future 
libpcre version.

Also, I don't see why grep needs to copy the buffer when there's an 
encoding error.  Why not simply rerun the matcher on the initial prefix 
that doesn't have an encoding-error byte, and then (if that doesn't find 
a match), try matching the suffix after the encoding-error byte?  This 
approach would not only avoid the buffer copy, it would avoid knowledge 
of libpcre internals.




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Mon, 01 Sep 2014 08:19:02 GMT) Full text and rfc822 format available.

Message #52 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 18266 <at> debbugs.gnu.org, Santiago <santiago <at> debian.org>,
 758105 <at> bugs.debian.org
Subject: Re: grep -P and invalid exits with error
Date: Mon, 1 Sep 2014 10:18:22 +0200
On 2014-08-29 06:43:45 -0700, Paul Eggert wrote:
> Thanks, but that patch seems to depend on libpcre internals, in that it
> "knows" that pcre_exec cannot possibly succeed without first checking its
> entire input buffer for invalid UTF-8 bytes.  Even if that's true now, it
> reflects a performance bug that might be fixed in a future libpcre version.

If I understand correctly, I don't think that's an internal.
The pcreapi(3) man page says about PCRE_NO_UTF8_CHECK:

      [...] Note that this option can also be passed to pcre_exec()
      and pcre_dfa_exec(), to suppress the validity checking of
      subject strings only. If the same string is being matched
      many times, the option can be safely set for the second and
      subsequent matchings to improve performance.

The last sentence would imply that the UTF8 checking is done on the
whole input buffer before matching is done.

> Also, I don't see why grep needs to copy the buffer when there's an encoding
> error.  Why not simply rerun the matcher on the initial prefix that doesn't
> have an encoding-error byte, and then (if that doesn't find a match), try
> matching the suffix after the encoding-error byte?  This approach would not
> only avoid the buffer copy, it would avoid knowledge of libpcre internals.

If there are many invalid UTF8 bytes, this would be slow, IMHO (it
could be worth a try, though).

But is the copy of the buffer really needed? Couldn't the invalid
UTF8 sequences just be replaced by null bytes?

Note that in case of invalid UTF8 bytes, in some (many?) cases, the
cause is a binary file (possibly with some text in it), where lines
can be very long. So, wouldn't it mean that it can take significantly
more memory?

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Mon, 01 Sep 2014 08:33:01 GMT) Full text and rfc822 format available.

Message #55 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: 18266 <at> debbugs.gnu.org, Santiago <santiago <at> debian.org>,
 758105 <at> bugs.debian.org
Subject: Re: grep -P and invalid exits with error
Date: Mon, 01 Sep 2014 01:31:53 -0700
Vincent Lefevre wrote:

>        [...] Note that this option can also be passed to pcre_exec()
>        and pcre_dfa_exec(), to suppress the validity checking of
>        subject strings only. If the same string is being matched
>        many times, the option can be safely set for the second and
>        subsequent matchings to improve performance.
>
> The last sentence would imply that the UTF8 checking is done on the
> whole input buffer before matching is done.

That's pretty subtle, and perhaps too subtle.  A plausible 
interpretation of the phrase "same string is being matched" is that 
libpcre checks only the matched string, and that bytes after the match 
(which did not need to be examined to do the match) are not checked. 
Can you confirm with the libpcre authors that this plausible 
interpretation is incorrect, i.e., that the entire input string is 
checked, even the unmatched part?  If that's what is intended, the 
documentation should state so clearly, so at least there's a 
documentation bug there.

> If there are many invalid UTF8 bytes, this would be slow, IMHO

That's OK.  We don't need grep -P to be fast on invalid input.

> But is the copy of the buffer really needed? Couldn't the invalid
> UTF8 sequences just be replaced by null bytes?

I'd rather not, because that changes the semantics of matching.  The 
null byte is valid input data that might get matched.

> in case of invalid UTF8 bytes, in some (many?) cases, the
> cause is a binary file (possibly with some text in it), where lines
> can be very long. So, wouldn't it mean that it can take significantly
> more memory?

Sure.  But that's the same for -P as it is for plain grep.




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Tue, 09 Sep 2014 02:45:03 GMT) Full text and rfc822 format available.

Message #58 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Santiago <santiago <at> debian.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 758105 <at> bugs.debian.org, Vincent Lefevre <vincent <at> vinc17.net>,
 18266 <at> debbugs.gnu.org
Subject: Re: grep -P and invalid exits with error
Date: Tue, 9 Sep 2014 04:44:43 +0200
[Message part 1 (text/plain, inline)]
Patch updated.  Paul, thanks for the previous comments. As you
suggested, the attached patch doesn't copy the buffer and splits the
input when it finds an invalid character.

For the moment, I don't see a cleaner way to avoid the pcre internals.

Regards,

Santiago
[grep-pcresearch-clean-utf8-3.patch (text/x-diff, attachment)]
[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Tue, 09 Sep 2014 15:42:05 GMT) Full text and rfc822 format available.

Message #61 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Santiago <santiago <at> debian.org>
Cc: 758105 <at> bugs.debian.org, Paul Eggert <eggert <at> cs.ucla.edu>,
 Vincent Lefevre <vincent <at> vinc17.net>, 18266 <at> debbugs.gnu.org
Subject: Re: bug#18266: grep -P and invalid exits with error
Date: Wed, 10 Sep 2014 00:40:53 +0900
I'm worried that to re-run for invalid UTF-8 makes slowness for searching
of the large number of binary files.






Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Tue, 09 Sep 2014 20:00:03 GMT) Full text and rfc822 format available.

Message #64 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, Santiago <santiago <at> debian.org>
Cc: 758105 <at> bugs.debian.org, Vincent Lefevre <vincent <at> vinc17.net>,
 18266 <at> debbugs.gnu.org
Subject: Re: bug#18266: grep -P and invalid exits with error
Date: Tue, 09 Sep 2014 12:59:27 -0700
[Message part 1 (text/plain, inline)]
Norihiro Tanaka wrote:
> I'm worried that to re-run for invalid UTF-8 makes slowness for searching
> of the large number of binary files.

Yes, that could be a problem, but even so it's better for grep to report 
matches than to give up and fail.  Perhaps someone could optimize this 
better later, but to be honest given how flaky libpcre is we're probably 
better off spending our scarce development resources elsewhere.

Santiago's latest patch still had some troubles, unfortunately.  It 
could mishandle '^' by having it match just past an encoding error.  It 
was less efficient than it could be, as it checked all valid bytes for 
UTF-8-edness twice.  If I understand PCRE correctly (which quite 
possibly I don't), it also appeared to mishandle matches that contain 
nested subexpressions.  But the worst part was that the code was too 
complicated (and this was true even before Santiago's patch was 
applied).  So I rewrote it and installed the attached patch instead. 
Please give it a try.
[0001-grep-P-now-treats-invalid-UTF-8-input-as-non-matchin.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Tue, 09 Sep 2014 23:40:01 GMT) Full text and rfc822 format available.

Message #67 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 758105 <at> bugs.debian.org, Santiago <santiago <at> debian.org>,
 Vincent Lefevre <vincent <at> vinc17.net>, 18266 <at> debbugs.gnu.org
Subject: Re: bug#18266: grep -P and invalid exits with error
Date: Wed, 10 Sep 2014 08:39:10 +0900
I see that new version has no response for following test which was used
previously.

    printf '\x80ab\n' | env LC_ALL=en_US.utf8 src/grep -P '.?b'





Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Wed, 10 Sep 2014 00:02:02 GMT) Full text and rfc822 format available.

Message #70 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 758105 <at> bugs.debian.org, Santiago <santiago <at> debian.org>,
 Vincent Lefevre <vincent <at> vinc17.net>, 18266 <at> debbugs.gnu.org
Subject: Re: bug#18266: grep -P and invalid exits with error
Date: Tue, 09 Sep 2014 17:00:51 -0700
Norihiro Tanaka wrote:
> I see that new version has no response for following test which was used
> previously.
>
>      printf '\x80ab\n' | env LC_ALL=en_US.utf8 src/grep -P '.?b'
>

Thanks for reporting that.  The test case works for me (Fedora 20 
x86-64, GCC 4.9.1):

$ printf '\x80ab\n' | env LC_ALL=en_US.utf8 src/grep -P '.?b' | od -c
0000000 200   a   b  \n
0000004

Fedora 20 is using pcre version 8.33-6.fc20; perhaps there's a PCRE 
version dependency here?  Can you use GDB to put a breakpoint on 
pcre_exec and see what values it's returning, and what it's storing into 
sub[0] and sub[1]?  Here's what I see (I compiled grep with '-g3 -O0'):

$ printf '\x80ab\n' >in
$ gdb src/grep
...
(gdb) b pcre_exec
...
(gdb) r -P '.?b' in
...
(gdb) fin
...
(gdb) n
...
(gdb) p e
$1 = -10
(gdb) c
...
(gdb) fin
...
(gdb) n
...
(gdb) p e
$2 = -1
(gdb) c
...
(gdb) fin
...
(gdb) n
...
(gdb) p e
$3 = 1
(gdb) p sub[0]
$4 = 0
(gdb) p sub[1]
$5 = 2
(gdb) p p
$6 = 0x62f001 "ab\n"
(gdb) p buf
$7 = 0x62f000 "\200ab\n"


That is, the first call to pcre_exec reports the encoding error, the 
second one (on the empty string) reports no match, and the third one (on 
"ab") finds the match.




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Wed, 10 Sep 2014 07:09:01 GMT) Full text and rfc822 format available.

Message #73 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 758105 <at> bugs.debian.org, Santiago <santiago <at> debian.org>,
 Vincent Lefevre <vincent <at> vinc17.net>, 18266 <at> debbugs.gnu.org
Subject: Re: bug#18266: grep -P and invalid exits with error
Date: Wed, 10 Sep 2014 00:08:18 -0700
[Message part 1 (text/plain, inline)]
Paul Eggert wrote:
> perhaps there's a PCRE version dependency here?

I found a PCRE-version-dependent problem that may be relevant, and 
installed the attached further patch to fix it.
[0001-grep-port-recent-fix-to-older-pcre-version.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Wed, 10 Sep 2014 11:23:01 GMT) Full text and rfc822 format available.

Message #76 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Santiago <santiago <at> debian.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>, 758105 <at> bugs.debian.org
Cc: 18266 <at> debbugs.gnu.org, Vincent Lefevre <vincent <at> vinc17.net>
Subject: Re: Bug#758105: bug#18266: grep -P and invalid exits with error
Date: Wed, 10 Sep 2014 13:22:36 +0200
El 10/09/14 a las 00:08, Paul Eggert escribió:
> Paul Eggert wrote:
> >perhaps there's a PCRE version dependency here?
> 
> I found a PCRE-version-dependent problem that may be relevant, and installed
> the attached further patch to fix it.

Thanks! I'm including this fix in the current debian package.

Santiago (Ruano Rincón)

P.S. Vincent Lefevre actually reported this bug, not Santiago Vila.




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Wed, 10 Sep 2014 14:21:01 GMT) Full text and rfc822 format available.

Message #79 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 758105 <at> bugs.debian.org, Santiago <santiago <at> debian.org>,
 Vincent Lefevre <vincent <at> vinc17.net>, 18266 <at> debbugs.gnu.org
Subject: Re: bug#18266: grep -P and invalid exits with error
Date: Wed, 10 Sep 2014 23:20:28 +0900
Thanks.  I have confirmed that new version has expected response as
following.

$ env LC_ALL=en_US.utf8 src/grep -P '.?b' in
ab






Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Thu, 11 Sep 2014 08:16:01 GMT) Full text and rfc822 format available.

Message #82 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Santiago <santiago <at> debian.org>
Cc: 18266 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>,
 758105 <at> bugs.debian.org
Subject: Re: Bug#758105: bug#18266: grep -P and invalid exits with error
Date: Thu, 11 Sep 2014 10:15:10 +0200
On 2014-09-10 13:22:36 +0200, Santiago wrote:
> Thanks! I'm including this fix in the current debian package.

Unfortunately, it is very slow, with a large slowdown factor.
I've just reported a new Debian concerning the performance problem.

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Thu, 11 Sep 2014 11:08:02 GMT) Full text and rfc822 format available.

Message #85 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 18266 <at> debbugs.gnu.org, Santiago <santiago <at> debian.org>,
 758105 <at> bugs.debian.org
Subject: handling bytes not part of the charset, and other garbage (was: grep
 -P and invalid exits with error)
Date: Thu, 11 Sep 2014 13:07:00 +0200
On 2014-09-01 01:31:53 -0700, Paul Eggert wrote:
> Vincent Lefevre wrote:
> >If there are many invalid UTF8 bytes, this would be slow, IMHO
> 
> That's OK.  We don't need grep -P to be fast on invalid input.

I can see a too important slowdown in practical cases.

> >But is the copy of the buffer really needed? Couldn't the invalid
> >UTF8 sequences just be replaced by null bytes?
> 
> I'd rather not, because that changes the semantics of matching.  The null
> byte is valid input data that might get matched.

It appears that the current behavior in UTF-8 is incorrect, even
without -P. For instance:

$ printf 'tr\xe8s\n' > text
$ grep 'tr.s' text
$ LC_ALL=C grep 'tr.s' text
tr<E8>s

There's no reason that '.' matches something that doesn't belong to
the charset in C locale, but doesn't match in a UTF-8 locale.

The pattern tr.s is used here to match the French word "très" in files
that could be encoded in ISO-8859-1 or UTF-8 locales. In the past,
before using UTF-8 locales, I was doing something like:

  grep -E 'tr..?s' text

to match both encodings, and this worked (I could get false positives,
but anyway, one is often not interested in all the real grep matches
in practice, so that even when knowing the encoding, one was already
getting false positives). It's annoying that now in UTF-8, one can no
longer match ISO-8859-1 text, and doing a pre-conversion would take
too much time.

Concerning binary files, I've never wanted to differentiate explicitly
null bytes and invalid UTF-8 sequences: IMHO, this is just garbage.
There are obviously no differences with patterns like 'some_word' or
'foo[0-9]*bar', but when I use a pattern like 'foo.bar' or 'foo.*bar',
I can see two valid reasons to handle these sequences in a similar
way with '.':

1. One may want to match "valid" (often in the sense "printable", in
the specified encoding) but unknown characters.

2. One may also want to match garbage (including null bytes, and also
bytes that do not have any meaning in the charset), with the drawback
that if the garbage contains a newline character, this won't work.

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Thu, 11 Sep 2014 16:23:04 GMT) Full text and rfc822 format available.

Message #88 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: 18266 <at> debbugs.gnu.org, Santiago <santiago <at> debian.org>,
 758105 <at> bugs.debian.org
Subject: Re: handling bytes not part of the charset, and other garbage
Date: Thu, 11 Sep 2014 09:22:49 -0700
Vincent Lefevre wrote:

> There's no reason that '.' matches something that doesn't belong to
> the charset in C locale, but doesn't match in a UTF-8 locale.

In the C locale on GNU/Linux, all byte values are members of the 
charset.  That is why it's OK for '.' to accept that byte in the C 
locale but reject it in a UTF-8 locale.

> It's annoying that now in UTF-8, one can no longer match ISO-8859-1 text

This has been true for quite some time in 'grep', at least with the 
standard matchers.  It may not have been true for -P but that relied on 
undefined behavior that could crash grep, and we can't have that.

It would make sense to add a notation to mean "match any character or 
invalid byte", as an extension.  That'd take some work, though.




Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Thu, 11 Sep 2014 17:08:02 GMT) Full text and rfc822 format available.

Notification sent to Santiago <santiago <at> debian.org>:
bug acknowledged by developer. (Thu, 11 Sep 2014 17:08:02 GMT) Full text and rfc822 format available.

Message #93 received at 18266-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Vincent Lefevre <vincent <at> vinc17.net>, Santiago <santiago <at> debian.org>
Cc: 18266-done <at> debbugs.gnu.org, 758105 <at> bugs.debian.org, 761157 <at> bugs.debian.org
Subject: Re: bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with
 error
Date: Thu, 11 Sep 2014 10:07:49 -0700
[Message part 1 (text/plain, inline)]
Vincent Lefevre wrote:

> I've just reported a new Debian concerning the performance problem.

It's not clear from http://bugs.debian.org/761157 that the performance 
problem occurs only with -P, but I assume that's what is meant.

Since this is a performance bug with PCRE, I suggest moving the Debian 
bug report to the Debian libpcre3 package.  Grep cannot go back to the 
old way, which could cause grep to crash, and the bug cannot be fixed in 
grep because libpcre3 does not provide a fast way to search arbitrary 
data that may include encoding errors.  It really is a problem that 
requires changes to libpcre3 to fix; grep cannot fix it.

In the meantime, in order to use 'grep' to search for strings in 
arbitrary data, I suggest omitting the '-P'.  Also, I suggest using the 
C locale.

As the GNU bug 18266 "grep -P and invalid exits with error" has been 
fixed, I'm closing that bug report.  Please feel free to open a separate 
GNU bug report for the performance issue.

PS.  While composing this email I noticed another bug in grep -P and 
encoding errors, which I fixed by installing the attached patch.
[0001-grep-fix-false-matches-with-P-.-and-invalid-UTF-8.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Thu, 11 Sep 2014 18:38:02 GMT) Full text and rfc822 format available.

Message #96 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: 18266 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>, 
 Santiago <santiago <at> debian.org>
Cc: 758105 <at> bugs.debian.org, 18266-done <at> debbugs.gnu.org,
 Vincent Lefevre <vincent <at> vinc17.net>, 761157 <at> bugs.debian.org
Subject: Re: bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with
 error
Date: Thu, 11 Sep 2014 11:37:11 -0700
On Thu, Sep 11, 2014 at 10:07 AM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> Vincent Lefevre wrote:
>
>> I've just reported a new Debian concerning the performance problem.
>
>
> It's not clear from http://bugs.debian.org/761157 that the performance
> problem occurs only with -P, but I assume that's what is meant.
>
> Since this is a performance bug with PCRE, I suggest moving the Debian bug
> report to the Debian libpcre3 package.  Grep cannot go back to the old way,
> which could cause grep to crash, and the bug cannot be fixed in grep because
> libpcre3 does not provide a fast way to search arbitrary data that may
> include encoding errors.  It really is a problem that requires changes to
> libpcre3 to fix; grep cannot fix it.
>
> In the meantime, in order to use 'grep' to search for strings in arbitrary
> data, I suggest omitting the '-P'.  Also, I suggest using the C locale.
>
> As the GNU bug 18266 "grep -P and invalid exits with error" has been fixed,
> I'm closing that bug report.  Please feel free to open a separate GNU bug
> report for the performance issue.
>
> PS.  While composing this email I noticed another bug in grep -P and
> encoding errors, which I fixed by installing the attached patch.

Thanks for fixing yet another bug, Paul.
Would you mind adding a test to trigger that one?




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Thu, 11 Sep 2014 18:38:03 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Thu, 11 Sep 2014 19:11:01 GMT) Full text and rfc822 format available.

Message #102 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>, 18266 <at> debbugs.gnu.org, 
 Santiago <santiago <at> debian.org>
Cc: 758105 <at> bugs.debian.org, 18266-done <at> debbugs.gnu.org,
 Vincent Lefevre <vincent <at> vinc17.net>, 761157 <at> bugs.debian.org
Subject: Re: bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with
 error
Date: Thu, 11 Sep 2014 12:10:27 -0700
[Message part 1 (text/plain, inline)]
On 09/11/2014 11:37 AM, Jim Meyering wrote:
> Would you mind adding a test to trigger that one?

Ordinarily I would have done that already but this -P stuff is so buggy 
and slow that I got discouraged.  (If we keep having trouble with -P I 
may start lobbying to remove it....) Anyway, I gave it a shot with the 
attached further patch.
[0001-grep-fix-false-matches-with-P-.-and-invalid-UTF-8.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Thu, 11 Sep 2014 19:11:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Fri, 12 Sep 2014 00:38:02 GMT) Full text and rfc822 format available.

Message #108 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 18266 <at> debbugs.gnu.org, Santiago <santiago <at> debian.org>,
 758105 <at> bugs.debian.org
Subject: Re: handling bytes not part of the charset, and other garbage
Date: Fri, 12 Sep 2014 02:36:59 +0200
On 2014-09-11 09:22:49 -0700, Paul Eggert wrote:
> Vincent Lefevre wrote:
> 
> >There's no reason that '.' matches something that doesn't belong to
> >the charset in C locale, but doesn't match in a UTF-8 locale.
> 
> In the C locale on GNU/Linux, all byte values are members of the charset.

I don't see any valid reason for that (the C locale corresponds
to ANSI_X3.4-1968, which is 7-bit only, so that there is some
inconsistency), except that it could be seen as more practical.
But then, I would say that this should be the same for invalid
byte sequences in a UTF-8 locale.

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Fri, 12 Sep 2014 01:17:02 GMT) Full text and rfc822 format available.

Message #111 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: 18266 <at> debbugs.gnu.org, 758105 <at> bugs.debian.org
Subject: Re: bug#18266: handling bytes not part of the charset, and other
 garbage
Date: Thu, 11 Sep 2014 18:16:29 -0700
Vincent Lefevre wrote:
> the C locale corresponds to ANSI_X3.4-1968,

No it doesn't, at least not on any current platform I'm aware of.  And 
POSIX does not require that.  POSIX even allows the C locale to be 
multibyte, e.g., UTF-8.

> I would say that this should be the same for invalid
> byte sequences in a UTF-8 locale.

One *could* design an encoding with that property, but it wouldn't be 
UTF-8; it would be something else.  I don't know of any C library that 
does that to UTF-8.  There are good arguments against doing it, e.g., 
one loses the property that one can concatenate character strings by 
concatenating their byte representations.

Anyway I'm afraid we may be going off the deep end here.  After all, 
grep can't impose its coding system design onto the operating system; 
it's more the other way around.




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Fri, 12 Sep 2014 01:42:02 GMT) Full text and rfc822 format available.

Message #114 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 18266 <at> debbugs.gnu.org, 758105 <at> bugs.debian.org
Subject: Re: bug#18266: handling bytes not part of the charset, and other
 garbage
Date: Fri, 12 Sep 2014 03:41:24 +0200
On 2014-09-11 18:16:29 -0700, Paul Eggert wrote:
> Vincent Lefevre wrote:
> >the C locale corresponds to ANSI_X3.4-1968,
> 
> No it doesn't, at least not on any current platform I'm aware of.

It does on Debian:

ypig% LC_ALL=C locale charmap
ANSI_X3.4-1968

> >I would say that this should be the same for invalid
> >byte sequences in a UTF-8 locale.
> 
> One *could* design an encoding with that property, but it wouldn't be UTF-8;
> it would be something else.  I don't know of any C library that does that to
> UTF-8.  There are good arguments against doing it, e.g., one loses the
> property that one can concatenate character strings by concatenating their
> byte representations.

I'm talking only about grep here.

BTW, the current behavior breaks the sometimes used "grep ." solution
to match non-empty lines.

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Fri, 12 Sep 2014 01:43:01 GMT) Full text and rfc822 format available.

Message #117 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 18266 <at> debbugs.gnu.org, Santiago <santiago <at> debian.org>,
 758105 <at> bugs.debian.org, 761157 <at> bugs.debian.org
Subject: Re: bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with
 error
Date: Fri, 12 Sep 2014 03:42:47 +0200
On 2014-09-11 10:07:49 -0700, Paul Eggert wrote:
> Vincent Lefevre wrote:
> >I've just reported a new Debian concerning the performance problem.
> 
> It's not clear from http://bugs.debian.org/761157 that the performance
> problem occurs only with -P, but I assume that's what is meant.

It's specific to -P:

2.18-2   0.9s with -P, 0.4s without -P
2.20-3  11.6s with -P, 0.4s without -P

> Since this is a performance bug with PCRE, I suggest moving the Debian bug
> report to the Debian libpcre3 package.  Grep cannot go back to the old way,
> which could cause grep to crash, and the bug cannot be fixed in grep because
> libpcre3 does not provide a fast way to search arbitrary data that may
> include encoding errors.  It really is a problem that requires changes to
> libpcre3 to fix; grep cannot fix it.

Fixing the performance problem in libpcre3 would indeed be better
(even with the old version of grep, libpcre3 was twice as slow as
grep, but this is less critical than a 13x slowdown).

However a workaround in grep could be simpler. I've just opened a
new bug and suggested several solutions:

  http://debbugs.gnu.org/cgi/bugreport.cgi?bug=18454

> In the meantime, in order to use 'grep' to search for strings in arbitrary
> data, I suggest omitting the '-P'.

This is a bit annoying because I sometimes use specific PCRE features.
I could try to parse the arguments, detect where the pattern is used,
and avoid -P if the pattern doesn't use specific PCRE features (at
least for the most common forms). An additional advantage is that it
could be twice as fast in most cases (see above). This could also be
done in grep, as I suggested in my new bug report.

> Also, I suggest using the C locale.

This could be a solution, because in practice, I pipe the result
to "less -FRX", but only grep has to use the C locale, so that the
accented characters are correctly displayed by "less". However with
some (rare?) patterns, it won't work because an accented character
would no longer be seen as a single character.

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Fri, 12 Sep 2014 03:27:02 GMT) Full text and rfc822 format available.

Message #120 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: 18266 <at> debbugs.gnu.org, 758105 <at> bugs.debian.org
Subject: Re: bug#18266: handling bytes not part of the charset, and other
 garbage
Date: Thu, 11 Sep 2014 20:26:12 -0700
Vincent Lefevre wrote:

> ypig% LC_ALL=C locale charmap
> ANSI_X3.4-1968

That may be what the 'locale' command says, but bytes with the top bit 
on are considered to be valid single-byte characters.  There are no 
encoding errors.  So, in that sense it's not strict ASCII.

> the current behavior breaks the sometimes used "grep ." solution
> to match non-empty lines.

"grep ." matches lines containing one or more characters.  Encoding 
errors are not characters, at least, not as far as plain grep is concerned.

Perhaps PCRE is different, and if libpcre worked with encoding errors we 
could simply use its way of matching them.  But there doesn't seem to be 
a safe way to do that.




Forcibly Merged 18266 18455. Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Fri, 12 Sep 2014 03:43:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Fri, 12 Sep 2014 08:30:03 GMT) Full text and rfc822 format available.

Message #125 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 18266 <at> debbugs.gnu.org, 758105 <at> bugs.debian.org
Subject: Re: bug#18266: handling bytes not part of the charset, and other
 garbage
Date: Fri, 12 Sep 2014 10:29:16 +0200
On 2014-09-11 20:26:12 -0700, Paul Eggert wrote:
> Vincent Lefevre wrote:
> 
> >ypig% LC_ALL=C locale charmap
> >ANSI_X3.4-1968
> 
> That may be what the 'locale' command says, but bytes with the top bit on
> are considered to be valid single-byte characters.  There are no encoding
> errors.  So, in that sense it's not strict ASCII.

Glibc regards it as ASCII:

$ printf '\xe8' | LC_ALL=C iconv
iconv: illegal input sequence at position 0

> >the current behavior breaks the sometimes used "grep ." solution
> >to match non-empty lines.
> 
> "grep ." matches lines containing one or more characters.  Encoding errors
> are not characters, at least, not as far as plain grep is concerned.

I just mean that "grep ." is a method given by some people, that
was working before UTF-8.

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Fri, 12 Sep 2014 16:14:02 GMT) Full text and rfc822 format available.

Message #128 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 761157 <761157 <at> bugs.debian.org>, Santiago <santiago <at> debian.org>,
 18266 <18266 <at> debbugs.gnu.org>, 18266-done <18266-done <at> debbugs.gnu.org>,
 Vincent Lefevre <vincent <at> vinc17.net>, 758105 <758105 <at> bugs.debian.org>
Subject: Re: bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with
 error
Date: Fri, 12 Sep 2014 09:13:22 -0700
On Thu, Sep 11, 2014 at 12:10 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 09/11/2014 11:37 AM, Jim Meyering wrote:
>>
>> Would you mind adding a test to trigger that one?
>
> Ordinarily I would have done that already but this -P stuff is so buggy and
> slow that I got discouraged.  (If we keep having trouble with -P I may start
> lobbying to remove it....) Anyway, I gave it a shot with the attached
> further patch.

Thank you. Looks perfect.

I too rely on grep's -P, sometimes using PCRE features
that are very hard to emulate using EREs.




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Fri, 12 Sep 2014 16:14:03 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Fri, 12 Sep 2014 16:17:01 GMT) Full text and rfc822 format available.

Message #134 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: 18266 <at> debbugs.gnu.org, 758105 <at> bugs.debian.org
Subject: Re: bug#18266: handling bytes not part of the charset, and other
 garbage
Date: Fri, 12 Sep 2014 09:16:45 -0700
Vincent Lefevre wrote:
> Glibc regards it as ASCII:

You're right.  Sorry, I was confused.  FreeBSD, Solaris, and AIX work 
the way that I thought, though.  Plus, in GNU regular expressions the 
pattern "." works the way that I thought with LC_ALL=C; my guess 
(without investigating this) is that this is because whoever wrote the 
regex code assumed the BSDish behavior.  Arguably this is a glitch in 
the GNU regex code, in that for consistency "." should not match 
encoding errors in unibyte locales.

Here's a pair of test cases to illustrate the glitch:

$ printf '\200\n' | LC_ALL=en_US.utf8 grep '.' | wc
      0       0       0
$ printf '\200\n' | LC_ALL=C grep '.' | wc
      1       0       2

> I just mean that "grep ." is a method given by some people, that
> was working before UTF-8.

And it still works, if by "." one means "match one character".

Unfortunately there is no POSIX regular expression that does what you're 
looking for (match either one character, or a single byte that is an 
encoding error).  This is because POSIX says the behavior is undefined 
on encoding errors.  The GNU syntax for regular expressions extends 
POSIX and does not dump core, but it still provides no way to write the 
pattern you're asking for, and the behavior is unspecified on encoding 
errors.  Perhaps this should be improved by fixing the abovementioned 
glitch and by providing a syntax extension for matching encoding errors, 
though we'd need a volunteer to do that.

The situation with libpcre is weirder: there's a pattern '\C' for 
matching a single byte even if it's an encoding error, but as far as I 
can tell there's no way to use regular expressions safely on arbitrary 
data containing encoding errors unless you're in unibyte mode (in which 
case '\C' provides no extra power).  I.e., \C appears to be useless in 
any program for which undefined behavior is unacceptable.




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Fri, 12 Sep 2014 21:30:03 GMT) Full text and rfc822 format available.

Message #137 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 18266 <at> debbugs.gnu.org, 758105 <at> bugs.debian.org
Subject: Re: bug#18266: handling bytes not part of the charset, and other
 garbage
Date: Fri, 12 Sep 2014 23:29:39 +0200
On 2014-09-12 09:16:45 -0700, Paul Eggert wrote:
> Vincent Lefevre wrote:
> >I just mean that "grep ." is a method given by some people, that
> >was working before UTF-8.
> 
> And it still works, if by "." one means "match one character".

No, by "working", I mean that "grep ." was matching any non-empty
line. A non-empty line is anything that is not "\n", with valid
characters and/or invalid byte sequences.

> Unfortunately there is no POSIX regular expression that does what you're
> looking for (match either one character, or a single byte that is an
> encoding error).  This is because POSIX says the behavior is undefined on
> encoding errors.

But since the behavior is undefined, a grep implementation is free
to do anything it likes, such as make "." match invalid bytes. See
below for details.

> The GNU syntax for regular expressions extends POSIX and does not
> dump core, but it still provides no way to write the pattern you're
> asking for, and the behavior is unspecified on encoding errors.
> Perhaps this should be improved by fixing the abovementioned glitch
> and by providing a syntax extension for matching encoding errors,
> though we'd need a volunteer to do that.

I'm not sure that a syntax extension would really be useful. I think
that an option to control what happens on encoding errors would be
better and sufficient. For instance, a choice between the 4 following
behaviors:

1. If an encoding error is encountered, grep returns an error. Some
encoding errors may remained unnoticed, e.g. if -m is used and the
max count has been reached (you can see the behavior of such an error
as being similar to a file read error). The error may be signaled
immediately, even when there is a match before.

2. An encoding error is never matched. I suppose that this is the
current behavior in UTF-8.

3. An encoding error is regarded as a special character different
from the other characters. In particular it will be matched by "."
and "[^...]". Whether a sequence of invalid bytes is regarded as a
single special character or several ones could be specified or not
(in practice, there could be 2 possibilities: either regard each
byte as a special character, or regard each longest valid prefix
as a special character). The properties of this special character
could be specified or not, concerning character classes (I would
say that the character doesn't fall in any class, possibly except
cntrl).

4. Like (3), but the character could be an existing one (such as \0).
The idea behind this behavior is that the user may not really care,
but wants grep to be fast. Now, unless \0 appears in the pattern
under some form, replacing the encoding error by a null character
would be equivalent to "(3) + the special character is in the cntrl
character class".

> The situation with libpcre is weirder: there's a pattern '\C' for
> matching a single byte even if it's an encoding error, but as far as
> I can tell there's no way to use regular expressions safely on
> arbitrary data containing encoding errors unless you're in unibyte
> mode (in which case '\C' provides no extra power). I.e., \C appears
> to be useless in any program for which undefined behavior is
> unacceptable.

In the context of libpcre (which doesn't support encoding errors,
contrary to Perl if I understand correctly), \C can still be used
and be useful when there are no encoding errors. But not that the
pcresyntax(3) man page says "best avoided", the pcrepattern(3) man
page says that it can yield undefined behavior (but gives a complex
example where it can be used), and the perlre(1) man page says that
\C is deprecated. So, grep could say that \C is not supported.

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Fri, 12 Sep 2014 21:40:02 GMT) Full text and rfc822 format available.

Message #140 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: 18266 <at> debbugs.gnu.org, 758105 <at> bugs.debian.org
Subject: Re: bug#18266: handling bytes not part of the charset, and other
 garbage
Date: Fri, 12 Sep 2014 14:39:35 -0700
On 09/12/2014 02:29 PM, Vincent Lefevre wrote:

> an option to control what happens on encoding errors would be better 
> and sufficient.

It might suffice for your use cases, but it's more complicated and less 
flexible than being able to match bytes within the regular expression.  
(Plus, someone would have to implement it, which is perhaps the biggest 
objection to either approach ....)  But I take your point that \C is 
best avoided.  This whole area is pretty hairy, I'm afraid.

Speaking of hairy, why doesn't grep use PCRE_MULTILINE?  Using 
PCRE_MULTILINE shouldn't be that hard, and should boost performance 
quite a bit in typical usage.  Or am I being too optimistic here?




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Fri, 12 Sep 2014 22:24:01 GMT) Full text and rfc822 format available.

Message #143 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 758105 <at> bugs.debian.org, Vincent Lefevre <vincent <at> vinc17.net>,
 18266 <at> debbugs.gnu.org
Subject: Re: bug#18266: handling bytes not part of the charset,
 and other garbage
Date: Fri, 12 Sep 2014 15:23:08 -0700
On Fri, Sep 12, 2014 at 2:39 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 09/12/2014 02:29 PM, Vincent Lefevre wrote:
>
>> an option to control what happens on encoding errors would be better and
>> sufficient.
>
>
> It might suffice for your use cases, but it's more complicated and less
> flexible than being able to match bytes within the regular expression.
> (Plus, someone would have to implement it, which is perhaps the biggest
> objection to either approach ....)  But I take your point that \C is best
> avoided.  This whole area is pretty hairy, I'm afraid.
>
> Speaking of hairy, why doesn't grep use PCRE_MULTILINE?  Using
> PCRE_MULTILINE shouldn't be that hard, and should boost performance quite a
> bit in typical usage.  Or am I being too optimistic here?

When I first saw that implementation, I assumed it was just a first-cut one.
I see no reason not to use PCRE_MULTILINE, but haven't tried it, either.




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Fri, 12 Sep 2014 22:41:03 GMT) Full text and rfc822 format available.

Message #146 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 18266 <at> debbugs.gnu.org, 758105 <at> bugs.debian.org
Subject: Re: bug#18266: handling bytes not part of the charset, and other
 garbage
Date: Sat, 13 Sep 2014 00:40:33 +0200
On 2014-09-12 14:39:35 -0700, Paul Eggert wrote:
> On 09/12/2014 02:29 PM, Vincent Lefevre wrote:
> >an option to control what happens on encoding errors would be
> >better and sufficient.
> 
> It might suffice for your use cases, but it's more complicated and less
> flexible than being able to match bytes within the regular expression.

But IMHO, some solutions I proposed would be faster.

I wonder whether anyone is interested in matching individual bytes
in a file regarded as UTF-8 encoded. This seems weird.

> Speaking of hairy, why doesn't grep use PCRE_MULTILINE?  Using
> PCRE_MULTILINE shouldn't be that hard, and should boost performance
> quite a bit in typical usage.  Or am I being too optimistic here?

Perhaps in text files. In binary files, with the current solution,
I don't think this matters as failures due to invalid bytes
typically occur several times per line.

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Sat, 13 Sep 2014 00:58:02 GMT) Full text and rfc822 format available.

Message #149 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: 18266 <at> debbugs.gnu.org, 758105 <at> bugs.debian.org
Subject: Re: bug#18266: handling bytes not part of the charset, and other
 garbage
Date: Fri, 12 Sep 2014 17:57:39 -0700
Vincent Lefevre wrote:
> I wonder whether anyone is interested in matching individual bytes
> in a file regarded as UTF-8 encoded. This seems weird.

It's not weird at all.  For example, suppose we invent the notation 
[[:error:]] to match encoding errors.  Then the pattern '[[:error:]]' 
would match all encoding errors in a file, which could well be a useful 
thing.

Currently, for example, the tz package <http://www.iana.org/time-zones> 
has a Make rule 'check_character_set' that verifies that the source 
files are all properly encoded.  It executes this shell command:

! grep -nv '^.*$' file names

This relies on GNU grep's behavior that "." does not match an encoding 
error.  But it's a command that is not obvious.  It'd be simpler and 
clearer to write this:

! grep -n '[[:error:]]' file names

if such a feature were available.




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Sat, 13 Sep 2014 01:18:02 GMT) Full text and rfc822 format available.

Message #152 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 18266 <at> debbugs.gnu.org, 758105 <at> bugs.debian.org
Subject: Re: bug#18266: handling bytes not part of the charset, and other
 garbage
Date: Sat, 13 Sep 2014 03:17:41 +0200
On 2014-09-12 17:57:39 -0700, Paul Eggert wrote:
> Currently, for example, the tz package <http://www.iana.org/time-zones> has
> a Make rule 'check_character_set' that verifies that the source files are
> all properly encoded.  It executes this shell command:
> 
> ! grep -nv '^.*$' file names
> 
> This relies on GNU grep's behavior that "." does not match an encoding
> error.  But it's a command that is not obvious.  It'd be simpler and clearer
> to write this:
> 
> ! grep -n '[[:error:]]' file names
> 
> if such a feature were available.

But both of these solutions have the drawback of working only in
UTF-8 locales. One may wonder whether grep is the right tool, as
"iconv -f UTF-8 -t UTF-8" can do such a check in any locale.

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Sat, 13 Sep 2014 02:09:02 GMT) Full text and rfc822 format available.

Message #155 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: 18266 <at> debbugs.gnu.org, 758105 <at> bugs.debian.org
Subject: Re: bug#18266: handling bytes not part of the charset, and other
 garbage
Date: Fri, 12 Sep 2014 19:08:38 -0700
Vincent Lefevre wrote:

> But both of these solutions have the drawback of working only in
> UTF-8 locales.

Not at all; '[[:error:]]' would match a single-byte encoding error in 
the current locale.  The tz database is interested in UTF-8 so it sets 
the LC_ALL environment variable to a UTF-8 locale, but that setting 
shouldn't be required in general.

Also, the tz database needs grep patterns that iconv doesn't support. 
For example, one rule is that commentary (which starts with #) can 
contain UTF-8 characters, but the ordinary data (before the #) is 
limited to a smaller set.  This is captured by the command:

grep -Env '^[ordinarycharset]*(#.*)?$'

where 'ordinarycharset' is the set of ASCII characters in ordinary tz 
data.  Here it's useful that '.' does not match encoding errors on 
GNU/Linux.




Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Sat, 13 Sep 2014 02:12:01 GMT) Full text and rfc822 format available.

Message #158 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: 18266 <at> debbugs.gnu.org, 758105 <at> bugs.debian.org
Subject: Re: bug#18266: handling bytes not part of the charset, and other
 garbage
Date: Fri, 12 Sep 2014 19:11:08 -0700
[Message part 1 (text/plain, inline)]
Come to think of it, grep -P misbehaves badly in multibyte locales that 
are not UTF-8.  It should report an error and exit rather than output 
gibberish.  I installed the attached patch to catch that.

[0001-grep-diagnose-P-in-non-UTF-8-multibyte-locale.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Mon, 15 Sep 2014 05:33:01 GMT) Full text and rfc822 format available.

Message #161 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: 18266 <at> debbugs.gnu.org, 758105 <at> bugs.debian.org
Subject: Re: bug#18266: handling bytes not part of the charset, and other
 garbage
Date: Sun, 14 Sep 2014 22:32:28 -0700
[Message part 1 (text/plain, inline)]
Attached are some proposed patches which should improve the performance 
of grep -P when applied to binary files, among other things.  I have 
some other ideas for boosting performance further but thought I'd 
publish these first.  Please give them a try if you have the time.  I 
doubt whether this will "solve" the performance problem entirely with -P 
and encoding errors but at least it should be heading in the right 
direction.
[0001-grep-remove-refactor-unnecessary-code-about-line-spl.patch (text/plain, attachment)]
[0002-grep-speed-up-P-on-files-containing-many-multibyte-e.patch (text/plain, attachment)]
[0003-grep-use-bool-for-boolean-in-grep.c.patch (text/plain, attachment)]
[0004-grep-treat-a-file-as-binary-if-its-prefix-contains-e.patch (text/plain, attachment)]
[0005-grep-improve-performance-for-older-glibc.patch (text/plain, attachment)]
[0006-grep-use-mbclen-cache-more-effectively.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#18266; Package grep. (Wed, 17 Sep 2014 01:29:02 GMT) Full text and rfc822 format available.

Message #164 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: 18266 <at> debbugs.gnu.org, 758105 <at> bugs.debian.org
Subject: Re: bug#18266: handling bytes not part of the charset, and other
 garbage
Date: Tue, 16 Sep 2014 18:28:18 -0700
[Message part 1 (text/plain, inline)]
Paul Eggert wrote:
> Attached are some proposed patches which should improve the performance
> of grep -P when applied to binary files, among other things.  I have
> some other ideas for boosting performance further but thought I'd
> publish these first.

I pushed those patches, along with the attached further patches to fix 
up some porting glitches and bugs I encountered in subsequent testing. 
I plan to follow up soon on Bug#18454 with more performance-related 
patches in this area.
[0007-grep-avoid-false-alarms-for-mb_clen-and-to_uchar.patch (text/plain, attachment)]
[0008-grep-use-mbclen-cache-in-one-more-place.patch (text/plain, attachment)]
[0009-grep-port-P-speedup-to-hosts-lacking-PCRE_STUDY_JIT_.patch (text/plain, attachment)]
[0010-grep-fix-P-speedup-bug-with-empty-match.patch (text/plain, attachment)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Wed, 15 Oct 2014 11:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 9 years and 212 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.