GNU bug report logs - #62983
workaround PCRE2 bug affecting at least \D and \W

Package: grep;

Reported by: Carlo Marcelo Arenas Belón <carenas <at> gmail.com>

Date: Fri, 21 Apr 2023 02:05:01 UTC

Severity: normal

To reply to this bug, email your comments to 62983 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-grep <at> gnu.org:
bug#62983; Package grep. (Fri, 21 Apr 2023 02:05:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Carlo Marcelo Arenas Belón <carenas <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Fri, 21 Apr 2023 02:05:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Carlo Marcelo Arenas Belón <carenas <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: workaround PCRE2 bug affecting at least \D and \W
Date: Thu, 20 Apr 2023 19:04:18 -0700

[Message part 1 (text/plain, inline)]

All versions of PCRE2 that include PCRE2_MATCH_INVALID_UTF had a bug on
its JIT implementation that results in failure to match for the negative
perl classes, and seems to be easier to replicate when the matching
character is a multibyte one.

Disable that flag and use the original fallback instead.

Alternatively JIT could be disabled instead, but the option selected has
less of an impact on performance.

Carlo

[0001-pcre-workaround-bug-affecting-W-or-D.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#62983; Package grep. (Fri, 21 Apr 2023 02:34:01 GMT) Full text and rfc822 format available.

Message #8 received at 62983 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Carlo Marcelo Arenas Belón <carenas <at> gmail.com>
Cc: 62983 <at> debbugs.gnu.org
Subject: Re: bug#62983: workaround PCRE2 bug affecting at least \D and \W
Date: Thu, 20 Apr 2023 19:33:25 -0700

On Thu, Apr 20, 2023 at 7:05 PM Carlo Marcelo Arenas Belón
<carenas <at> gmail.com> wrote:
> All versions of PCRE2 that include PCRE2_MATCH_INVALID_UTF had a bug on
> its JIT implementation that results in failure to match for the negative
> perl classes, and seems to be easier to replicate when the matching
> character is a multibyte one.
>
> Disable that flag and use the original fallback instead.
>
> Alternatively JIT could be disabled instead, but the option selected has
> less of an impact on performance.

Thanks for the patch! Is there any PCRE-upstream discussion about this?
If so, I'd like to reference that from your commit log.

Information forwarded to bug-grep <at> gnu.org:
bug#62983; Package grep. (Fri, 21 Apr 2023 02:36:02 GMT) Full text and rfc822 format available.

Message #11 received at 62983 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Carlo Marcelo Arenas Belón <carenas <at> gmail.com>
Cc: 62983 <at> debbugs.gnu.org
Subject: Re: bug#62983: workaround PCRE2 bug affecting at least \D and \W
Date: Thu, 20 Apr 2023 19:35:05 -0700

On Thu, Apr 20, 2023 at 7:33 PM Jim Meyering <jim <at> meyering.net> wrote:
>
> On Thu, Apr 20, 2023 at 7:05 PM Carlo Marcelo Arenas Belón
> <carenas <at> gmail.com> wrote:
> > All versions of PCRE2 that include PCRE2_MATCH_INVALID_UTF had a bug on
> > its JIT implementation that results in failure to match for the negative
> > perl classes, and seems to be easier to replicate when the matching
> > character is a multibyte one.
> >
> > Disable that flag and use the original fallback instead.
> >
> > Alternatively JIT could be disabled instead, but the option selected has
> > less of an impact on performance.
>
> Thanks for the patch! Is there any PCRE-upstream discussion about this?
> If so, I'd like to reference that from your commit log.

Oh! I see it in the test file:
  https://github.com/PCRE2Project/pcre2/issues/224

Information forwarded to bug-grep <at> gnu.org:
bug#62983; Package grep. (Fri, 21 Apr 2023 18:44:01 GMT) Full text and rfc822 format available.

Message #14 received at 62983 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Carlo Marcelo Arenas Belón <carenas <at> gmail.com>
Cc: 62983 <at> debbugs.gnu.org
Subject: Re: bug#62983: workaround PCRE2 bug affecting at least \D and \W
Date: Fri, 21 Apr 2023 11:42:50 -0700

[Message part 1 (text/plain, inline)]

On 2023-04-20 19:04, Carlo Marcelo Arenas Belón wrote:
> All versions of PCRE2 that include PCRE2_MATCH_INVALID_UTF had a bug on
> its JIT implementation that results in failure to match for the negative
> perl classes, and seems to be easier to replicate when the matching
> character is a multibyte one.

Unfortunately that is a little vague. I expect the issue is not limited 
to \D and \W, as there are other ways to specify negative Perl classes. 
And if the bug merely seems to be easier to replicate with multibyte 
characters, it sounds like we may have issues even when matching ASCII 
characters in a UTF-8 locale.

Furthermore, I'm leery of optimizing for PCRE2 10.42 and earlier. We 
should focus our optimization efforts on future PCRE2 versions, and not 
worry about optimizing earlier versions where optimizations complicate 
maintenance for a declining benefit, and are likely to provoke bugs in 
older versions that as time passes will be harder to debug.

> Alternatively JIT could be disabled instead, but the option selected has
> less of an impact on performance.

Disabling JIT sounds better, as correctness trumps performance. Until 
the bug is fixed (or at least better-understood so that we have a 
workaround we can trust), how about the attached patch instead?

[0001-grep-use-PCRE2-JIT-only-in-unibyte-locales.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#62983; Package grep. (Fri, 21 Apr 2023 20:22:02 GMT) Full text and rfc822 format available.

Message #17 received at 62983 <at> debbugs.gnu.org (full text, mbox):

From: Carlo Marcelo Arenas Belón <carenas <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 62983 <at> debbugs.gnu.org
Subject: Re: bug#62983: workaround PCRE2 bug affecting at least \D and \W
Date: Fri, 21 Apr 2023 13:20:53 -0700

[Message part 1 (text/plain, inline)]

On Fri, Apr 21, 2023 at 11:42:50AM -0700, Paul Eggert wrote:
> On 2023-04-20 19:04, Carlo Marcelo Arenas Belón wrote:
> > All versions of PCRE2 that include PCRE2_MATCH_INVALID_UTF had a bug on
> > its JIT implementation that results in failure to match for the negative
> > perl classes, and seems to be easier to replicate when the matching
> > character is a multibyte one.
> 
> Unfortunately that is a little vague. I expect the issue is not limited to
> \D and \W, as there are other ways to specify negative Perl classes.

Correct, it should also affect at least \S, but hadn't been able to trigger
it there.

The bug was that an uninitialized value was being used in the JIT code that
supports the PCRE2_MATCH_INVALID_UTF mode. which is why I said "randomly" in
the commit message.

If you want to be strict, how about the attached patch instead?

> And if
> the bug merely seems to be easier to replicate with multibyte characters, it
> sounds like we may have issues even when matching ASCII characters in a
> UTF-8 locale.

Which the current workaround addresses, since you need both PCRE2_JIT and
PCRE2_MATCH_INVALID_UTF to trigger it, and the subject encoding is irrelevant
for the logic to decide if PCRE2_MATCH_INVALID_UTF gets enabled or not.

> Furthermore, I'm leery of optimizing for PCRE2 10.42 and earlier. We should
> focus our optimization efforts on future PCRE2 versions, and not worry about
> optimizing earlier versions where optimizations complicate maintenance for a
> declining benefit, and are likely to provoke bugs in older versions that as
> time passes will be harder to debug.

Not sure I understand your concern here, but if it is about disabling JIT
insteed, then the possibility of introducing bugs is even bigger since it
affects all versions of PCRE2 (not only 10.34 or newer).

> > Alternatively JIT could be disabled instead, but the option selected has
> > less of an impact on performance.
> 
> Disabling JIT sounds better, as correctness trumps performance. Until the
> bug is fixed (or at least better-understood so that we have a workaround we
> can trust), how about the attached patch instead?

The bug has been fixed already, and will be included in the next release.
There might be additional changes as spelled in that discussion, and indeed
the change to the proposed solution proactively helps with one of those.

It is very unlikely, but some systems might include non 0 values on the
tables for characters over 127 and that might trigger a similar problem that
is yet to be fixed.

Carlo

[1] https://github.com/PCRE2Project/pcre2/commit/2c08b619dc973beacc474dcb67cda8cd366200ce

[Message part 2 (text/x-patch, inline)]

From 919d4aa016dd979a52b9e5fd3b0ba1d1cf833ac8 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Carlo=20Marcelo=20Arenas=20Bel=C3=B3n?= <carenas <at> gmail.com>
Date: Thu, 20 Apr 2023 18:37:20 -0700
Subject: [PATCH v2] pcre: workaround bug affecting PCRE2_MATCH_INVALID_UTF

PCRE2 has a bug when using PCRE2_MATCH_INVALID_UTF that would
randomly fail to match patterns using perl negative classes
(like \W or \D).

* NEWS: mention this
* src/pcre2search.c: restric impact of the but
not use the problematic flag in all broken versions of PCRE2
only generate locale tables for non Unicode
* tests: add new pcre2-utf-bug224 test with replications for \[W|D]
---
 NEWS                   |  5 +++++
 src/pcresearch.c       | 22 ++++++++++++++--------
 tests/Makefile.am      |  1 +
 tests/pcre-utf8-bug224 | 31 +++++++++++++++++++++++++++++++
 4 files changed, 51 insertions(+), 8 deletions(-)
 create mode 100755 tests/pcre-utf8-bug224

diff --git a/NEWS b/NEWS
index f16c576..3552db1 100644
--- a/NEWS
+++ b/NEWS
@@ -15,6 +15,11 @@ GNU grep NEWS                                    -*- outline -*-
   when running on 32-bit x86 and ARM hosts using glibc 2.34+.
   [bug introduced in grep 3.9]
 
+  grep no longer fails to match patterns which relied on negative perl
+  classes like \D or \W when linked with PCRE2 10.34 or newer.
+  [bug introduced in grep 3.8]
+
+
 ** Changes in behavior
 
   grep --version now prints a line describing the version of PCRE2 it uses.
diff --git a/src/pcresearch.c b/src/pcresearch.c
index e867f49..a64b65b 100644
--- a/src/pcresearch.c
+++ b/src/pcresearch.c
@@ -58,6 +58,9 @@ struct pcre_comp
   /* Table, indexed by ! (flag & PCRE2_NOTBOL), of whether the empty
      string matches when that flag is used.  */
   int empty_match[2];
+
+  /* Flags */
+  unsigned binary_safe:1;
 };
 
 /* Memory allocation functions for PCRE.  */
@@ -130,16 +133,11 @@ jit_exec (struct pcre_comp *pc, char const *subject, idx_t search_bytes,
     }
 }
 
-/* Return true if E is an error code for bad UTF-8, and if pcre2_match
-   could return E because PCRE lacks PCRE2_MATCH_INVALID_UTF.  */
+/* Return true if E is an error code for bad UTF-8 */
 static bool
 bad_utf8_from_pcre2 (int e)
 {
-#ifdef PCRE2_MATCH_INVALID_UTF
-  return false;
-#else
   return PCRE2_ERROR_UTF8_ERR21 <= e && e <= PCRE2_ERROR_UTF8_ERR1;
-#endif
 }
 
 /* Compile the -P style PATTERN, containing SIZE bytes that are
@@ -157,6 +155,7 @@ Pcompile (char *pattern, idx_t size, reg_syntax_t ignored, bool exact)
     = pcre2_general_context_create (private_malloc, private_free, NULL);
   pcre2_compile_context *ccontext = pcre2_compile_context_create (gcontext);
 
+  pc->binary_safe = false;
   if (localeinfo.multibyte)
     {
       uint32_t unicode;
@@ -181,8 +180,13 @@ Pcompile (char *pattern, idx_t size, reg_syntax_t ignored, bool exact)
       flags |= PCRE2_NEVER_BACKSLASH_C;
 #endif
 #ifdef PCRE2_MATCH_INVALID_UTF
+      /* workaround PCRE2 bug
+         https://github.com/PCRE2Project/pcre2/issues/224 */
+#if PCRE2_MAJOR == 10 && PCRE2_MINOR > 42
+      pc->binary_safe = true;
       /* Consider invalid UTF-8 as a barrier, instead of error.  */
       flags |= PCRE2_MATCH_INVALID_UTF;
+#endif
 #endif
     }
 
@@ -226,7 +230,9 @@ Pcompile (char *pattern, idx_t size, reg_syntax_t ignored, bool exact)
       size = re_size;
     }
 
-  pcre2_set_character_tables (ccontext, pcre2_maketables (gcontext));
+  if (!localeinfo.multibyte)
+    pcre2_set_character_tables (ccontext, pcre2_maketables (gcontext));
+
   pc->cre = pcre2_compile ((PCRE2_SPTR) pattern, size, flags,
                            &ec, &e, ccontext);
   if (!pc->cre)
@@ -313,7 +319,7 @@ Pexecute (void *vcp, char const *buf, idx_t size, idx_t *match_size,
 
           e = jit_exec (pc, subject, line_end - subject,
                         search_offset, options);
-          if (!bad_utf8_from_pcre2 (e))
+          if (pc->binary_safe || !bad_utf8_from_pcre2 (e))
             break;
 
           idx_t valid_bytes = pcre2_get_startchar (pc->data);
diff --git a/tests/Makefile.am b/tests/Makefile.am
index 7718f24..9b4422e 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -155,6 +155,7 @@ TESTS =						\
   pcre-jitstack					\
   pcre-o					\
   pcre-utf8					\
+  pcre-utf8-bug224				\
   pcre-utf8-w					\
   pcre-w					\
   pcre-wx-backref				\
diff --git a/tests/pcre-utf8-bug224 b/tests/pcre-utf8-bug224
new file mode 100755
index 0000000..549cc43
--- /dev/null
+++ b/tests/pcre-utf8-bug224
@@ -0,0 +1,31 @@
+#!/bin/sh
+# Ensure negative perl classes matches multibyte characters in UTF mode
+#
+# Copyright (C) 2023 Free Software Foundation, Inc.
+#
+# Copying and distribution of this file, with or without modification,
+# are permitted in any medium without royalty provided the copyright
+# notice and this notice are preserved.
+
+. "${srcdir=.}/init.sh"; path_prepend_ ../src
+require_en_utf8_locale_
+LC_ALL=en_US.UTF-8
+export LC_ALL
+require_pcre_
+
+echo . | grep -qP '(*UTF).' 2>/dev/null \
+  || skip_ 'PCRE unicode support is compiled out'
+
+fail=0
+
+# 'ñ' (U+00F1)
+printf '\302\221\n' > in || framework_failure_
+grep -P '\D' in > out || fail=1
+compare in out || fail=1
+
+# “𝄞” (U+1D11E)
+printf '\360\235\204\236\n' > in || framework_failure_
+grep -P '\W' in > out || fail=1
+compare in out || fail=1
+
+Exit $fail
-- 
2.39.2 (Apple Git-143)

Information forwarded to bug-grep <at> gnu.org:
bug#62983; Package grep. (Sat, 29 Apr 2023 06:56:02 GMT) Full text and rfc822 format available.

Message #20 received at 62983 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Carlo Marcelo Arenas Belón <carenas <at> gmail.com>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 62983 <at> debbugs.gnu.org
Subject: Re: bug#62983: workaround PCRE2 bug affecting at least \D and \W
Date: Sat, 29 Apr 2023 08:54:44 +0200

[Message part 1 (text/plain, inline)]

On Fri, Apr 21, 2023 at 10:22 PM Carlo Marcelo Arenas Belón
<carenas <at> gmail.com> wrote:
> On Fri, Apr 21, 2023 at 11:42:50AM -0700, Paul Eggert wrote:
> > On 2023-04-20 19:04, Carlo Marcelo Arenas Belón wrote:
> > > All versions of PCRE2 that include PCRE2_MATCH_INVALID_UTF had a bug on
> > > its JIT implementation that results in failure to match for the negative
> > > perl classes, and seems to be easier to replicate when the matching
> > > character is a multibyte one.
> >
> > Unfortunately that is a little vague. I expect the issue is not limited to
> > \D and \W, as there are other ways to specify negative Perl classes.
>
> Correct, it should also affect at least \S, but hadn't been able to trigger
> it there.
>
> The bug was that an uninitialized value was being used in the JIT code that
> supports the PCRE2_MATCH_INVALID_UTF mode. which is why I said "randomly" in
> the commit message.
>
> If you want to be strict, how about the attached patch instead?
>
> > And if
> > the bug merely seems to be easier to replicate with multibyte characters, it
> > sounds like we may have issues even when matching ASCII characters in a
> > UTF-8 locale.
>
> Which the current workaround addresses, since you need both PCRE2_JIT and
> PCRE2_MATCH_INVALID_UTF to trigger it, and the subject encoding is irrelevant
> for the logic to decide if PCRE2_MATCH_INVALID_UTF gets enabled or not.
>
> > Furthermore, I'm leery of optimizing for PCRE2 10.42 and earlier. We should
> > focus our optimization efforts on future PCRE2 versions, and not worry about
> > optimizing earlier versions where optimizations complicate maintenance for a
> > declining benefit, and are likely to provoke bugs in older versions that as
> > time passes will be harder to debug.
>
> Not sure I understand your concern here, but if it is about disabling JIT
> insteed, then the possibility of introducing bugs is even bigger since it
> affects all versions of PCRE2 (not only 10.34 or newer).
>
> > > Alternatively JIT could be disabled instead, but the option selected has
> > > less of an impact on performance.
> >
> > Disabling JIT sounds better, as correctness trumps performance. Until the
> > bug is fixed (or at least better-understood so that we have a workaround we
> > can trust), how about the attached patch instead?
>
> The bug has been fixed already, and will be included in the next release.
> There might be additional changes as spelled in that discussion, and indeed
> the change to the proposed solution proactively helps with one of those.
>
> It is very unlikely, but some systems might include non 0 values on the
> tables for characters over 127 and that might trigger a similar problem that
> is yet to be fixed.
>
> Carlo
>
> [1] https://github.com/PCRE2Project/pcre2/commit/2c08b619dc973beacc474dcb67cda8cd366200ce

Thanks, Carlo.
I've made some small adjustments and tidied up the ChangeLog in the attached.
Hope to push it by Sunday.

There's enough going on via gnulib that I'll likely make yet another
snapshot with the very latest.

Also, there remain solaris sparc and i386 gnulib test failures:

    https://buildfarm.opencsw.org/buildbot/builders/ggrep-solaris10-sparc/builds/336
      FAIL: test-c-stack.sh
      FAIL: test-year2038

    https://buildfarm.opencsw.org/buildbot/builders/ggrep-solaris10-i386/builds/334
      FAIL: test-year2038

[grep-pcre2.diff (application/octet-stream, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#62983; Package grep. (Sat, 29 Apr 2023 18:46:02 GMT) Full text and rfc822 format available.

Message #23 received at 62983 <at> debbugs.gnu.org (full text, mbox):

From: Carlo Arenas <carenas <at> gmail.com>
To: Jim Meyering <jim <at> meyering.net>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 62983 <at> debbugs.gnu.org
Subject: Re: bug#62983: workaround PCRE2 bug affecting at least \D and \W
Date: Sat, 29 Apr 2023 11:45:05 -0700

Just some nitpicking, but could we use single quotes around the '𝄞'
character in pcre-utf8-bug224 instead of double quotes?

Carlo

Information forwarded to bug-grep <at> gnu.org:
bug#62983; Package grep. (Sun, 30 Apr 2023 01:16:02 GMT) Full text and rfc822 format available.

Message #26 received at 62983 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>,
 Carlo Marcelo Arenas Belón <carenas <at> gmail.com>
Cc: 62983 <at> debbugs.gnu.org
Subject: Re: bug#62983: workaround PCRE2 bug affecting at least \D and \W
Date: Sat, 29 Apr 2023 18:15:05 -0700

[Message part 1 (text/plain, inline)]

On 2023-04-28 23:54, Jim Meyering wrote:
> I've made some small adjustments and tidied up the ChangeLog in the attached.

One question about that patch (both original and as revised). Why do we 
need a new binary_safe slot in struct pcre_comp? Shouldn't the 
binary_safe stuff be done at compile-time rather than run-time?

Proposed revised patch attached. It also tweaks commentary slightly, and 
uses a more uniform style in the test comments (something like what 
Carlo suggested, but a bit wordier since it names the characters).

[0001-pcre-work-around-a-PCRE2_MATCH_INVALID_UTF-bug.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#62983; Package grep. (Sun, 30 Apr 2023 07:02:02 GMT) Full text and rfc822 format available.

Message #29 received at 62983 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Carlo Marcelo Arenas Belón <carenas <at> gmail.com>,
 62983 <at> debbugs.gnu.org
Subject: Re: bug#62983: workaround PCRE2 bug affecting at least \D and \W
Date: Sun, 30 Apr 2023 09:01:26 +0200

On Sun, Apr 30, 2023 at 3:15 AM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
>
> On 2023-04-28 23:54, Jim Meyering wrote:
> > I've made some small adjustments and tidied up the ChangeLog in the attached.
>
> One question about that patch (both original and as revised). Why do we
> need a new binary_safe slot in struct pcre_comp? Shouldn't the
> binary_safe stuff be done at compile-time rather than run-time?
>
> Proposed revised patch attached. It also tweaks commentary slightly, and
> uses a more uniform style in the test comments (something like what
> Carlo suggested, but a bit wordier since it names the characters).

Thanks, Paul. I prefer that. Pushed.
I've also pushed an update to use the latest from gnulib.

This bug report was last modified 1 year and 15 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #62983 workaround PCRE2 bug affecting at least \D and \W

GNU bug report logs - #62983
workaround PCRE2 bug affecting at least \D and \W