GNU bug report logs - #47264
RFE: pcre2 support

Previous Next

Package: grep;

Reported by: Jaroslav Skarvada <jskarvad <at> redhat.com>

Date: Fri, 19 Mar 2021 15:23:01 UTC

Severity: wishlist

Merged with 22345, 40395

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 47264 in the body.
You can then email your comments to 47264 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#47264; Package grep. (Fri, 19 Mar 2021 15:23:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Jaroslav Skarvada <jskarvad <at> redhat.com>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Fri, 19 Mar 2021 15:23:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Jaroslav Skarvada <jskarvad <at> redhat.com>
To: bug-grep <at> gnu.org
Subject: RFE: pcre2 support
Date: Fri, 19 Mar 2021 11:22:15 -0400 (EDT)
Hi,

PCRE library has been superseded with PCRE2 project by PCRE
upstream in 2015. PCRE upstream considers PCRE obsolete now
and does not devote any resources to PCRE except of critical
bugs. Please consider adding PCRE2 support.

Downstream Fedora bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1755491

thanks & regards

Jaroslav





Information forwarded to bug-grep <at> gnu.org:
bug#47264; Package grep. (Fri, 19 Mar 2021 21:05:02 GMT) Full text and rfc822 format available.

Message #8 received at 47264 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jaroslav Skarvada <jskarvad <at> redhat.com>, 47264 <at> debbugs.gnu.org
Subject: Re: bug#47264: RFE: pcre2 support
Date: Fri, 19 Mar 2021 14:04:01 -0700
On 3/19/21 8:22 AM, Jaroslav Skarvada wrote:
> Please consider adding PCRE2 support.

Patches would be welcome. As I understand it, they're not trivial.




Information forwarded to bug-grep <at> gnu.org:
bug#47264; Package grep. (Mon, 05 Apr 2021 08:22:02 GMT) Full text and rfc822 format available.

Message #11 received at 47264 <at> debbugs.gnu.org (full text, mbox):

From: L A Walsh <gnu <at> tlinx.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Jaroslav Skarvada <jskarvad <at> redhat.com>, 47264 <at> debbugs.gnu.org
Subject: Re: bug#47264: RFE: pcre2 support
Date: Mon, 05 Apr 2021 01:21:28 -0700
On 2021/03/19 14:04, Paul Eggert wrote:
> On 3/19/21 8:22 AM, Jaroslav Skarvada wrote:
>   
>> Please consider adding PCRE2 support.
>>     
>
> Patches would be welcome. As I understand it, they're not trivial.
>   
---
   What are differences from 1->2 and wondering why they are
so non-trivial?  I guess there are some benefits in moving
to 2?

Mostly curious...
thanks


>
>
>   




Information forwarded to bug-grep <at> gnu.org:
bug#47264; Package grep. (Mon, 05 Apr 2021 08:33:01 GMT) Full text and rfc822 format available.

Message #14 received at 47264 <at> debbugs.gnu.org (full text, mbox):

From: Jeffrey Walton <noloader <at> gmail.com>
To: L A Walsh <gnu <at> tlinx.org>
Cc: 47264 <at> debbugs.gnu.org
Subject: Re: bug#47264: RFE: pcre2 support
Date: Mon, 5 Apr 2021 04:31:38 -0400
On Mon, Apr 5, 2021 at 4:22 AM L A Walsh <gnu <at> tlinx.org> wrote:
>
> On 2021/03/19 14:04, Paul Eggert wrote:
> > On 3/19/21 8:22 AM, Jaroslav Skarvada wrote:
> >
> >> Please consider adding PCRE2 support.
> >>
> >
> > Patches would be welcome. As I understand it, they're not trivial.
> >
> ---
>     What are differences from 1->2 and wondering why they are
> so non-trivial?  I guess there are some benefits in moving
> to 2?
>
> Mostly curious...

I took a crack at converting grep to PCRE2 a couple of months ago.

There are lots of API differences. I estimate about 80% of the code
was going to be modified or changed for PCRE2. There's also a data
structure that needs to be modified.

I could not find a good document explaining the conversion process
from PCRE to PCRE2. I finally gave up after about 4 hours because I
lacked the necessary PCRE experience.

Someone more experienced with PCRE or PCRE2 can probably perform the
task in a few hours.

Jeff




Information forwarded to bug-grep <at> gnu.org:
bug#47264; Package grep. (Mon, 05 Apr 2021 16:59:01 GMT) Full text and rfc822 format available.

Message #17 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: bug-grep <at> gnu.org
Cc: Jaroslav Skarvada <jskarvad <at> redhat.com>
Subject: Re: bug#47264: RFE: pcre2 support
Date: Mon, 5 Apr 2021 09:58:39 -0700
On 4/5/21 1:31 AM, Jeffrey Walton wrote:
> I could not find a good document explaining the conversion process
> from PCRE to PCRE2. I finally gave up after about 4 hours because I
> lacked the necessary PCRE experience.

I had a similar experience years ago.

Part of my worry was that PCRE is squirrelly about dealing with 
improperly-coded data - it has undefined behavior in that case, which is 
unacceptable for 'grep'. So one must be verrrry careful in doing any API 
conversion. I didn't have the time, at the time. Still don't.




Severity set to 'wishlist' from 'normal' Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Mon, 09 Aug 2021 22:40:01 GMT) Full text and rfc822 format available.

Merged 22345 40395 47264. Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Mon, 09 Aug 2021 22:40:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#47264; Package grep. (Sun, 07 Nov 2021 19:28:02 GMT) Full text and rfc822 format available.

Message #24 received at 47264 <at> debbugs.gnu.org (full text, mbox):

From: Carlo Marcelo Arenas Belón <carenas <at> gmail.com>
To: 47264 <at> debbugs.gnu.org
Cc: Carlo Marcelo Arenas Belón <carenas <at> gmail.com>
Subject: [PATCH] pcre: migrate to pcre2
Date: Sun,  7 Nov 2021 11:26:28 -0800
Mostly a bug by bug translation of the original code to the PCRE2 API.
but includes a couple of fixes as well that might be worth doing in
independent patches, if a straight translation is preferable.

Code still could do with some optimizations but should be good as a
starting point.

The API changes the sign of some types and therefore some ugly casts
were needed, some of the changes are just to make sure all variables
fit into the newer types better.

Includes backward compatibility and could be made to build all the way
to 10.00, but assumes a recent version ~10.30; the configure rule sets
a strict minimum of 10.34 as that is required to pass all tests, even
if the issues are minimal and likely to be real bugs that the old PCRE
just hide, there is likely more work pending in this area.

Performance seems equivalent, and it also seems functionally complete.

Signed-off-by: Carlo Marcelo Arenas Belón <carenas <at> gmail.com>
---
 configure.ac             |   2 +-
 doc/grep.in.1            |   8 +-
 doc/grep.texi            |   2 +-
 m4/{pcre.m4 => pcre2.m4} |  23 ++--
 src/pcresearch.c         | 235 ++++++++++++++++++---------------------
 tests/filename-lineno.pl |   4 +-
 6 files changed, 126 insertions(+), 148 deletions(-)
 rename m4/{pcre.m4 => pcre2.m4} (66%)

diff --git a/configure.ac b/configure.ac
index c49ec4a..9291cee 100644
--- a/configure.ac
+++ b/configure.ac
@@ -197,7 +197,7 @@ if test "$ac_use_included_regex" = no; then
   AC_MSG_WARN([Included lib/regex.c not used])
 fi
 
-gl_FUNC_PCRE
+gl_FUNC_PCRE2
 AM_CONDITIONAL([USE_PCRE], [test $use_pcre = yes])
 
 case $host_os in
diff --git a/doc/grep.in.1 b/doc/grep.in.1
index b014f65..208cb76 100644
--- a/doc/grep.in.1
+++ b/doc/grep.in.1
@@ -756,7 +756,7 @@ In other implementations, basic regular expressions are less powerful.
 The following description applies to extended regular expressions;
 differences for basic regular expressions are summarized afterwards.
 Perl-compatible regular expressions give additional functionality, and are
-documented in B<pcresyntax>(3) and B<pcrepattern>(3), but work only if
+documented in B<pcre2syntax>(3) and B<pcre2pattern>(3), but work only if
 PCRE support is enabled.
 .PP
 The fundamental building blocks are the regular expressions
@@ -1360,9 +1360,9 @@ from the globbing syntax that the shell uses to match file names.
 .BR sort (1),
 .BR xargs (1),
 .BR read (2),
-.BR pcre (3),
-.BR pcresyntax (3),
-.BR pcrepattern (3),
+.BR pcre2 (3),
+.BR pcre2syntax (3),
+.BR pcre2pattern (3),
 .BR terminfo (5),
 .BR glob (7),
 .BR regex (7)
diff --git a/doc/grep.texi b/doc/grep.texi
index e5b9fd8..c3c4bbf 100644
--- a/doc/grep.texi
+++ b/doc/grep.texi
@@ -1168,7 +1168,7 @@ In other implementations, basic regular expressions are less powerful.
 The following description applies to extended regular expressions;
 differences for basic regular expressions are summarized afterwards.
 Perl-compatible regular expressions give additional functionality, and
-are documented in the @i{pcresyntax}(3) and @i{pcrepattern}(3) manual
+are documented in the @i{pcre2syntax}(3) and @i{pcre2pattern}(3) manual
 pages, but work only if PCRE is available in the system.
 
 @menu
diff --git a/m4/pcre.m4 b/m4/pcre2.m4
similarity index 66%
rename from m4/pcre.m4
rename to m4/pcre2.m4
index 78b7fda..6822500 100644
--- a/m4/pcre.m4
+++ b/m4/pcre2.m4
@@ -1,15 +1,15 @@
-# pcre.m4 - check for libpcre support
+# pcre2.m4 - check for libpcre2 support
 
 # Copyright (C) 2010-2021 Free Software Foundation, Inc.
 # This file is free software; the Free Software Foundation
 # gives unlimited permission to copy and/or distribute it,
 # with or without modifications, as long as this notice is preserved.
 
-AC_DEFUN([gl_FUNC_PCRE],
+AC_DEFUN([gl_FUNC_PCRE2],
 [
   AC_ARG_ENABLE([perl-regexp],
     AS_HELP_STRING([--disable-perl-regexp],
-                   [disable perl-regexp (pcre) support]),
+                   [disable perl-regexp (pcre2) support]),
     [case $enableval in
        yes|no) test_pcre=$enableval;;
        *) AC_MSG_ERROR([invalid value $enableval for --disable-perl-regexp]);;
@@ -21,24 +21,25 @@ AC_DEFUN([gl_FUNC_PCRE],
   use_pcre=no
 
   if test $test_pcre != no; then
-    PKG_CHECK_MODULES([PCRE], [libpcre], [], [: ${PCRE_LIBS=-lpcre}])
+    PKG_CHECK_MODULES([PCRE], [libpcre2-8], [], [: ${PCRE_LIBS=-lpcre2-8}])
 
-    AC_CACHE_CHECK([for pcre_compile], [pcre_cv_have_pcre_compile],
+    AC_CACHE_CHECK([for pcre2_compile], [pcre_cv_have_pcre2_compile],
       [pcre_saved_CFLAGS=$CFLAGS
        pcre_saved_LIBS=$LIBS
        CFLAGS="$CFLAGS $PCRE_CFLAGS"
        LIBS="$PCRE_LIBS $LIBS"
        AC_LINK_IFELSE(
-         [AC_LANG_PROGRAM([[#include <pcre.h>
+         [AC_LANG_PROGRAM([[#define PCRE2_CODE_UNIT_WIDTH 8
+                            #include <pcre2.h>
                           ]],
-            [[pcre *p = pcre_compile (0, 0, 0, 0, 0);
+            [[pcre2_code *p = pcre2_compile (0, 0, PCRE2_MATCH_INVALID_UTF, 0, 0, 0);
               return !p;]])],
-         [pcre_cv_have_pcre_compile=yes],
-         [pcre_cv_have_pcre_compile=no])
+         [pcre_cv_have_pcre2_compile=yes],
+         [pcre_cv_have_pcre2_compile=no])
        CFLAGS=$pcre_saved_CFLAGS
        LIBS=$pcre_saved_LIBS])
 
-    if test "$pcre_cv_have_pcre_compile" = yes; then
+    if test "$pcre_cv_have_pcre2_compile" = yes; then
       use_pcre=yes
     elif test $test_pcre = maybe; then
       AC_MSG_WARN([AC_PACKAGE_NAME will be built without pcre support.])
@@ -50,7 +51,7 @@ AC_DEFUN([gl_FUNC_PCRE],
   if test $use_pcre = yes; then
     AC_DEFINE([HAVE_LIBPCRE], [1],
       [Define to 1 if you have the Perl Compatible Regular Expressions
-       library (-lpcre).])
+       library (-lpcre2).])
   else
     PCRE_CFLAGS=
     PCRE_LIBS=
diff --git a/src/pcresearch.c b/src/pcresearch.c
index 3bdaee9..36171ed 100644
--- a/src/pcresearch.c
+++ b/src/pcresearch.c
@@ -17,41 +17,21 @@
    02110-1301, USA.  */
 
 /* Written August 1992 by Mike Haertel. */
+/* Updated for PCRE2 by Carlo Arenas. */
 
 #include <config.h>
 #include "search.h"
 #include "die.h"
 
-#include <pcre.h>
-
-/* This must be at least 2; everything after that is for performance
-   in pcre_exec.  */
-enum { NSUB = 300 };
-
-#ifndef PCRE_EXTRA_MATCH_LIMIT_RECURSION
-# define PCRE_EXTRA_MATCH_LIMIT_RECURSION 0
-#endif
-#ifndef PCRE_STUDY_JIT_COMPILE
-# define PCRE_STUDY_JIT_COMPILE 0
-#endif
-#ifndef PCRE_STUDY_EXTRA_NEEDED
-# define PCRE_STUDY_EXTRA_NEEDED 0
-#endif
+#define PCRE2_CODE_UNIT_WIDTH 8
+#include <pcre2.h>
 
 struct pcre_comp
 {
   /* Compiled internal form of a Perl regular expression.  */
-  pcre *cre;
-
-  /* Additional information about the pattern.  */
-  pcre_extra *extra;
-
-#if PCRE_STUDY_JIT_COMPILE
-  /* The JIT stack and its maximum size.  */
-  pcre_jit_stack *jit_stack;
-  int jit_stack_size;
-#endif
-
+  pcre2_code *cre;
+  pcre2_match_context *mcontext;
+  pcre2_match_data *data;
   /* Table, indexed by ! (flag & PCRE_NOTBOL), of whether the empty
      string matches when that flag is used.  */
   int empty_match[2];
@@ -60,51 +40,50 @@ struct pcre_comp
 
 /* Match the already-compiled PCRE pattern against the data in SUBJECT,
    of size SEARCH_BYTES and starting with offset SEARCH_OFFSET, with
-   options OPTIONS, and storing resulting matches into SUB.  Return
-   the (nonnegative) match location or a (negative) error number.  */
+   options OPTIONS.
+   Return the (nonnegative) match count or a (negative) error number.  */
 static int
 jit_exec (struct pcre_comp *pc, char const *subject, int search_bytes,
-          int search_offset, int options, int *sub)
+          int search_offset, uint32_t options)
 {
   while (true)
     {
-      int e = pcre_exec (pc->cre, pc->extra, subject, search_bytes,
-                         search_offset, options, sub, NSUB);
+      bool dynamic_jit_stack = false;
+      int e = pcre2_match (pc->cre, (PCRE2_SPTR8)subject, search_bytes,
+                           search_offset, options, pc->data, pc->mcontext);
+
+      if (e == PCRE2_ERROR_JIT_STACKLIMIT && dynamic_jit_stack)
+        return e;
 
-#if PCRE_STUDY_JIT_COMPILE
-      if (e == PCRE_ERROR_JIT_STACKLIMIT
-          && 0 < pc->jit_stack_size && pc->jit_stack_size <= INT_MAX / 2)
+      if (e == PCRE2_ERROR_JIT_STACKLIMIT)
         {
-          int old_size = pc->jit_stack_size;
-          int new_size = pc->jit_stack_size = old_size * 2;
-          if (pc->jit_stack)
-            pcre_jit_stack_free (pc->jit_stack);
-          pc->jit_stack = pcre_jit_stack_alloc (old_size, new_size);
-          if (!pc->jit_stack)
+          /* The PCRE documentation says that a 32 KiB stack is the default.  */
+          pcre2_jit_stack *s = pcre2_jit_stack_create (64 << 10, INT_MAX, NULL);
+
+          if (!pc->mcontext)
+            pc->mcontext = pcre2_match_context_create (NULL);
+
+          if (!pc->mcontext || !s)
             die (EXIT_TROUBLE, 0,
                  _("failed to allocate memory for the PCRE JIT stack"));
-          pcre_assign_jit_stack (pc->extra, NULL, pc->jit_stack);
+          pcre2_jit_stack_assign (pc->mcontext, NULL, s);
+          dynamic_jit_stack = true;
           continue;
         }
-#endif
-
-#if PCRE_EXTRA_MATCH_LIMIT_RECURSION
-      if (e == PCRE_ERROR_RECURSIONLIMIT
-          && (PCRE_STUDY_EXTRA_NEEDED || pc->extra))
+      if (e == PCRE2_ERROR_DEPTHLIMIT)
         {
-          unsigned long lim
-            = (pc->extra->flags & PCRE_EXTRA_MATCH_LIMIT_RECURSION
-               ? pc->extra->match_limit_recursion
-               : 0);
-          if (lim <= ULONG_MAX / 2)
-            {
-              pc->extra->match_limit_recursion = lim ? 2 * lim : (1 << 24) - 1;
-              pc->extra->flags |= PCRE_EXTRA_MATCH_LIMIT_RECURSION;
-              continue;
-            }
-        }
-#endif
+          uint32_t lim;
+          pcre2_config (PCRE2_CONFIG_DEPTHLIMIT, &lim);
+          if (lim > INT32_MAX)
+            return e;
 
+          lim *= 2;
+          if (!pc->mcontext)
+            pc->mcontext = pcre2_match_context_create (NULL);
+
+          pcre2_set_depth_limit (pc->mcontext, lim);
+          continue;
+        }
       return e;
     }
 }
@@ -115,27 +94,35 @@ jit_exec (struct pcre_comp *pc, char const *subject, int search_bytes,
 void *
 Pcompile (char *pattern, idx_t size, reg_syntax_t ignored, bool exact)
 {
-  int e;
-  char const *ep;
+  PCRE2_SIZE e;
+  int ec;
+  PCRE2_UCHAR8 ep[128];
   static char const wprefix[] = "(?<!\\w)(?:";
   static char const wsuffix[] = ")(?!\\w)";
   static char const xprefix[] = "^(?:";
   static char const xsuffix[] = ")$";
   int fix_len_max = MAX (sizeof wprefix - 1 + sizeof wsuffix - 1,
                          sizeof xprefix - 1 + sizeof xsuffix - 1);
-  char *re = xnmalloc (4, size + (fix_len_max + 4 - 1) / 4);
-  int flags = PCRE_DOLLAR_ENDONLY | (match_icase ? PCRE_CASELESS : 0);
+  unsigned char *re = xnmalloc (4, size + (fix_len_max + 4 - 1) / 4);
+  uint32_t flags = PCRE2_DOLLAR_ENDONLY | (match_icase ? PCRE2_CASELESS : 0);
   char *patlim = pattern + size;
-  char *n = re;
-  char const *p;
-  char const *pnul;
+  char *n = (char *)re;
   struct pcre_comp *pc = xcalloc (1, sizeof (*pc));
+  pcre2_compile_context *ccontext = pcre2_compile_context_create(NULL);
 
   if (localeinfo.multibyte)
     {
       if (! localeinfo.using_utf8)
         die (EXIT_TROUBLE, 0, _("-P supports only unibyte and UTF-8 locales"));
-      flags |= PCRE_UTF8;
+      flags |= PCRE2_UTF;
+#ifdef PCRE2_NEVER_BACKSLASH_C
+      /* do not match individual code units but only UTF-8  */
+      flags |= PCRE2_NEVER_BACKSLASH_C;
+#endif
+#ifdef PCRE2_MATCH_INVALID_UTF
+      /* consider invalid UTF-8 as a barrier, instead of error  */
+      flags |= PCRE2_MATCH_INVALID_UTF;
+#endif
     }
 
   /* FIXME: Remove this restriction.  */
@@ -148,56 +135,37 @@ Pcompile (char *pattern, idx_t size, reg_syntax_t ignored, bool exact)
   if (match_lines)
     strcpy (n, xprefix);
   n += strlen (n);
-
-  /* The PCRE interface doesn't allow NUL bytes in the pattern, so
-     replace each NUL byte in the pattern with the four characters
-     "\000", removing a preceding backslash if there are an odd
-     number of backslashes before the NUL.  */
-  *patlim = '\0';
-  for (p = pattern; (pnul = p + strlen (p)) < patlim; p = pnul + 1)
+  memcpy (n, pattern, size);
+  n += size;
+  if (match_words && !match_lines)
     {
-      memcpy (n, p, pnul - p);
-      n += pnul - p;
-      for (p = pnul; pattern < p && p[-1] == '\\'; p--)
-        continue;
-      n -= (pnul - p) & 1;
-      strcpy (n, "\\000");
-      n += 4;
-    }
-  memcpy (n, p, patlim - p + 1);
-  n += patlim - p;
-  *patlim = '\n';
-
-  if (match_words)
     strcpy (n, wsuffix);
+    n += strlen(wsuffix);
+    }
   if (match_lines)
+    {
     strcpy (n, xsuffix);
+    n += strlen(xsuffix);
+    }
 
-  pc->cre = pcre_compile (re, flags, &ep, &e, pcre_maketables ());
+  pcre2_set_character_tables (ccontext, pcre2_maketables (NULL));
+  pc->cre = pcre2_compile (re, n - (char *)re, flags, &ec, &e, ccontext);
   if (!pc->cre)
-    die (EXIT_TROUBLE, 0, "%s", ep);
+    {
+      pcre2_get_error_message (ec, ep, 128);
+      die (EXIT_TROUBLE, 0, "%s", ep);
+    }
 
-  int pcre_study_flags = PCRE_STUDY_EXTRA_NEEDED | PCRE_STUDY_JIT_COMPILE;
-  pc->extra = pcre_study (pc->cre, pcre_study_flags, &ep);
-  if (ep)
-    die (EXIT_TROUBLE, 0, "%s", ep);
+  pc->data = pcre2_match_data_create_from_pattern (pc->cre, NULL);
 
-#if PCRE_STUDY_JIT_COMPILE
-  if (pcre_fullinfo (pc->cre, pc->extra, PCRE_INFO_JIT, &e))
-    die (EXIT_TROUBLE, 0, _("internal error (should never happen)"));
-
-  /* The PCRE documentation says that a 32 KiB stack is the default.  */
-  if (e)
-    pc->jit_stack_size = 32 << 10;
-#endif
+  ec = pcre2_jit_compile (pc->cre, PCRE2_JIT_COMPLETE);
+  if (ec && ec != PCRE2_ERROR_JIT_BADOPTION && ec != PCRE2_ERROR_NOMEMORY)
+    die (EXIT_TROUBLE, 0, _("JIT internal error: %d"), ec);
 
   free (re);
 
-  int sub[NSUB];
-  pc->empty_match[false] = pcre_exec (pc->cre, pc->extra, "", 0, 0,
-                                      PCRE_NOTBOL, sub, NSUB);
-  pc->empty_match[true] = pcre_exec (pc->cre, pc->extra, "", 0, 0, 0, sub,
-                                     NSUB);
+  pc->empty_match[false] = jit_exec (pc, "", 0, 0, PCRE2_NOTBOL);
+  pc->empty_match[true] = jit_exec (pc, "", 0, 0, 0);
 
   return pc;
 }
@@ -206,15 +174,15 @@ ptrdiff_t
 Pexecute (void *vcp, char const *buf, idx_t size, idx_t *match_size,
           char const *start_ptr)
 {
-  int sub[NSUB];
   char const *p = start_ptr ? start_ptr : buf;
   bool bol = p[-1] == eolbyte;
   char const *line_start = buf;
-  int e = PCRE_ERROR_NOMATCH;
+  int e = PCRE2_ERROR_NOMATCH;
   char const *line_end;
   struct pcre_comp *pc = vcp;
+  PCRE2_SIZE *sub = pcre2_get_ovector_pointer(pc->data);
 
-  /* The search address to pass to pcre_exec.  This is the start of
+  /* The search address to pass to PCRE.  This is the start of
      the buffer, or just past the most-recently discovered encoding
      error or line end.  */
   char const *subject = buf;
@@ -226,14 +194,14 @@ Pexecute (void *vcp, char const *buf, idx_t size, idx_t *match_size,
          better and the correctness issues were too puzzling.  See
          Bug#22655.  */
       line_end = rawmemchr (p, eolbyte);
-      if (INT_MAX < line_end - p)
+      if (PCRE2_SIZE_MAX < line_end - p)
         die (EXIT_TROUBLE, 0, _("exceeded PCRE's line length limit"));
 
       for (;;)
         {
           /* Skip past bytes that are easily determined to be encoding
              errors, treating them as data that cannot match.  This is
-             faster than having pcre_exec check them.  */
+             faster than having PCRE check them.  */
           while (localeinfo.sbclen[to_uchar (*p)] == -1)
             {
               p++;
@@ -241,10 +209,10 @@ Pexecute (void *vcp, char const *buf, idx_t size, idx_t *match_size,
               bol = false;
             }
 
-          int search_offset = p - subject;
+          PCRE2_SIZE search_offset = p - subject;
 
           /* Check for an empty match; this is faster than letting
-             pcre_exec do it.  */
+             PCRE do it.  */
           if (p == line_end)
             {
               sub[0] = sub[1] = search_offset;
@@ -252,15 +220,15 @@ Pexecute (void *vcp, char const *buf, idx_t size, idx_t *match_size,
               break;
             }
 
-          int options = 0;
+          uint32_t options = 0;
           if (!bol)
-            options |= PCRE_NOTBOL;
+            options |= PCRE2_NOTBOL;
 
-          e = jit_exec (pc, subject, line_end - subject, search_offset,
-                        options, sub);
-          if (e != PCRE_ERROR_BADUTF8)
+          e = jit_exec (pc, subject, line_end - subject,
+                        search_offset, options);
+          if (PCRE2_ERROR_UTF8_ERR1 <= e || e < PCRE2_ERROR_UTF8_ERR21)
             break;
-          int valid_bytes = sub[0];
+          PCRE2_SIZE valid_bytes = sub[0];
 
           if (search_offset <= valid_bytes)
             {
@@ -275,9 +243,9 @@ Pexecute (void *vcp, char const *buf, idx_t size, idx_t *match_size,
                 }
               else
                 e = jit_exec (pc, subject, valid_bytes, search_offset,
-                              options | PCRE_NO_UTF8_CHECK | PCRE_NOTEOL, sub);
+                              options | PCRE2_NO_UTF_CHECK | PCRE2_NOTEOL);
 
-              if (e != PCRE_ERROR_NOMATCH)
+              if (e != PCRE2_ERROR_NOMATCH)
                 break;
 
               /* Treat the encoding error as data that cannot match.  */
@@ -288,7 +256,7 @@ Pexecute (void *vcp, char const *buf, idx_t size, idx_t *match_size,
           subject += valid_bytes + 1;
         }
 
-      if (e != PCRE_ERROR_NOMATCH)
+      if (e != PCRE2_ERROR_NOMATCH)
         break;
       bol = true;
       p = subject = line_start = line_end + 1;
@@ -299,26 +267,35 @@ Pexecute (void *vcp, char const *buf, idx_t size, idx_t *match_size,
     {
       switch (e)
         {
-        case PCRE_ERROR_NOMATCH:
+        case PCRE2_ERROR_NOMATCH:
           break;
 
-        case PCRE_ERROR_NOMEMORY:
+        case PCRE2_ERROR_NOMEMORY:
           die (EXIT_TROUBLE, 0, _("%s: memory exhausted"), input_filename ());
 
-#if PCRE_STUDY_JIT_COMPILE
-        case PCRE_ERROR_JIT_STACKLIMIT:
+        case PCRE2_ERROR_JIT_STACKLIMIT:
           die (EXIT_TROUBLE, 0, _("%s: exhausted PCRE JIT stack"),
                input_filename ());
-#endif
 
-        case PCRE_ERROR_MATCHLIMIT:
+        case PCRE2_ERROR_MATCHLIMIT:
           die (EXIT_TROUBLE, 0, _("%s: exceeded PCRE's backtracking limit"),
                input_filename ());
 
-        case PCRE_ERROR_RECURSIONLIMIT:
-          die (EXIT_TROUBLE, 0, _("%s: exceeded PCRE's recursion limit"),
+        case PCRE2_ERROR_DEPTHLIMIT:
+          die (EXIT_TROUBLE, 0,
+               _("%s: exceeded PCRE's nested backtracking limit"),
                input_filename ());
 
+        case PCRE2_ERROR_RECURSELOOP:
+          die (EXIT_TROUBLE, 0, _("%s: PCRE detected recurse loop"),
+               input_filename ());
+
+#ifdef PCRE2_ERROR_HEAPLIMIT
+        case PCRE2_ERROR_HEAPLIMIT:
+          die (EXIT_TROUBLE, 0, _("%s: exceeded PCRE's heap limit"),
+               input_filename ());
+#endif
+
         default:
           /* For now, we lump all remaining PCRE failures into this basket.
              If anyone cares to provide sample grep usage that can trigger
diff --git a/tests/filename-lineno.pl b/tests/filename-lineno.pl
index 1e84b45..1ff3d6a 100755
--- a/tests/filename-lineno.pl
+++ b/tests/filename-lineno.pl
@@ -101,13 +101,13 @@ my @Tests =
    ],
    ['invalid-re-P-paren', '-P ")"', {EXIT=>2},
     {ERR => $ENV{PCRE_WORKS} == 1
-       ? "$prog: unmatched parentheses\n"
+       ? "$prog: unmatched closing parenthesis\n"
        : $no_pcre
     },
    ],
    ['invalid-re-P-star-paren', '-P "a.*)"', {EXIT=>2},
     {ERR => $ENV{PCRE_WORKS} == 1
-       ? "$prog: unmatched parentheses\n"
+       ? "$prog: unmatched closing parenthesis\n"
        : $no_pcre
     },
    ],
-- 
2.34.0.rc0.327.g197905bccb





Information forwarded to bug-grep <at> gnu.org:
bug#47264; Package grep. (Mon, 08 Nov 2021 00:31:01 GMT) Full text and rfc822 format available.

Message #27 received at 47264 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Carlo Marcelo Arenas Belón <carenas <at> gmail.com>,
 47264 <at> debbugs.gnu.org
Subject: Re: bug#47264: [PATCH] pcre: migrate to pcre2
Date: Sun, 7 Nov 2021 16:30:30 -0800
On 11/7/21 11:26, Carlo Marcelo Arenas Belón wrote:
> Mostly a bug by bug translation of the original code to the PCRE2 API.
> but includes a couple of fixes as well that might be worth doing in
> independent patches, if a straight translation is preferable.

Yes, that might be preferable so that if there are problems we can 
bisect better.


> The API changes the sign of some types and therefore some ugly casts
> were needed,

I see one ugly cast from char const * to PCRE2_SPTR8; it seems 
unavoidable unless we go through void * (and I see little point to doing 
that). Were there some other ugly casts that I missed?

> some of the changes are just to make sure all variables
> fit into the newer types better.

Yes, we do want to avoid overflow bugs.

> Includes backward compatibility and could be made to build all the way
> to 10.00, but assumes a recent version ~10.30; the configure rule sets
> a strict minimum of 10.34 as that is required to pass all tests, even
> if the issues are minimal and likely to be real bugs that the old PCRE
> just hide, there is likely more work pending in this area.

pcre2 10.34 is two years old. Seems old enough to me (tho perhaps others 
can chime in).

> -          int search_offset, int options, int *sub)
> +          int search_offset, uint32_t options)

Let's stick with 'int' unless we absolutely need uint32_t. uint32_t is 
ugly and is not required by the C standard and although PCRE2 evidently 
requires it, I'd rather stick to signed types for the usual reasons.

> +          if (lim > INT32_MAX)
> +            return e;
> +          lim *= 2;

This should use UINT32_MAX / 2 instead of INT32_MAX (the two expressions 
have different values on some oddball platforms). Better yet, replace 
the whole thing with "if (INT_MULTIPLY_WRAPV (lim, 2, &lim)) return e;".

> +  unsigned char *re = xnmalloc (4, size + (fix_len_max + 4 - 1) / 4);

We don't need to multiply by 4 any more, right? Because we no longer 
need to do the trick of replacing NUL with \000 in the pattern.

> +      flags |= PCRE2_UTF;
> +#ifdef PCRE2_NEVER_BACKSLASH_C
> +      /* do not match individual code units but only UTF-8  */
> +      flags |= PCRE2_NEVER_BACKSLASH_C;
> +#endif
> +#ifdef PCRE2_MATCH_INVALID_UTF
> +      /* consider invalid UTF-8 as a barrier, instead of error  */
> +      flags |= PCRE2_MATCH_INVALID_UTF;
> +#endif

Which versions of PCRE2 lack these flags? Should we bother to support 
these old versions?

If memory serves grep currently takes care to not pass invalid UTF-8 in 
the buffer or pattern. Does PCRE2_MATCH_INVALID_UTF make this work obsolete?

> +  PCRE2_UCHAR8 ep[128];

How do we know 128 is long enough? Deserves a comment.

> Performance seems equivalent, and it also seems functionally complete.

Very good news. Thanks for working on this.




Information forwarded to bug-grep <at> gnu.org:
bug#47264; Package grep. (Mon, 08 Nov 2021 03:25:01 GMT) Full text and rfc822 format available.

Message #30 received at 47264 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Carlo Marcelo Arenas Belón <carenas <at> gmail.com>,
 47264 <at> debbugs.gnu.org
Subject: Re: bug#47264: [PATCH] pcre: migrate to pcre2
Date: Sun, 7 Nov 2021 19:24:02 -0800
Thanks Carlos for working on that and Paul for the speedy feedback!
I won't be able to spend time on this for the next couple of weeks.




Information forwarded to bug-grep <at> gnu.org:
bug#47264; Package grep. (Mon, 08 Nov 2021 09:48:02 GMT) Full text and rfc822 format available.

Message #33 received at 47264 <at> debbugs.gnu.org (full text, mbox):

From: Carlo Arenas <carenas <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 47264 <at> debbugs.gnu.org
Subject: Re: bug#47264: [PATCH] pcre: migrate to pcre2
Date: Mon, 8 Nov 2021 01:47:25 -0800
On Sun, Nov 7, 2021 at 4:30 PM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
>
> On 11/7/21 11:26, Carlo Marcelo Arenas Belón wrote:
> > Mostly a bug by bug translation of the original code to the PCRE2 API.
> > but includes a couple of fixes as well that might be worth doing in
> > independent patches, if a straight translation is preferable.
>
> Yes, that might be preferable so that if there are problems we can
> bisect better.

Working on it, I will open independent bugs then for them, but I have
to admit I am a little concerned that since they are all touching the
same code it might be more difficult to keep track of the end result
and the need to merge conflict resolution.

Let me know how to help otherwise.

> > Includes backward compatibility and could be made to build all the way
> > to 10.00, but assumes a recent version ~10.30; the configure rule sets
> > a strict minimum of 10.34 as that is required to pass all tests, even
> > if the issues are minimal and likely to be real bugs that the old PCRE
> > just hide, there is likely more work pending in this area.
>
> pcre2 10.34 is two years old. Seems old enough to me (tho perhaps others
> can chime in).

Ironically, I am the user of one of those, as my debian10 developer
box uses 10.32, and indeed CentOS 7 is on even an older 10.23.
Forcing 10.34 to compile would allow us to clean up the code
significantly, and was what I was aiming for originally, but it seems
that the added support needed for older versions isn't that difficult
either.
I didn't want anyone hitting on those old PCRE2 bugs though with this
first release, hence why the configure rule is there for now (even if
I am likely going to remove it for the next version)

> > +  unsigned char *re = xnmalloc (4, size + (fix_len_max + 4 - 1) / 4);
>
> We don't need to multiply by 4 any more, right? Because we no longer
> need to do the trick of replacing NUL with \000 in the pattern.

I couldn't find a rationale for it in the emails or commit history,
but will clean it up with the other suggestions for next release.

> > +      flags |= PCRE2_UTF;
> > +#ifdef PCRE2_NEVER_BACKSLASH_C
> > +      /* do not match individual code units but only UTF-8  */
> > +      flags |= PCRE2_NEVER_BACKSLASH_C;
> > +#endif
> > +#ifdef PCRE2_MATCH_INVALID_UTF
> > +      /* consider invalid UTF-8 as a barrier, instead of error  */
> > +      flags |= PCRE2_MATCH_INVALID_UTF;
> > +#endif
>
> Which versions of PCRE2 lack these flags? Should we bother to support
> these old versions?

PCRE2_NEVER_BACKSLASH_C is from 10.20 (~2015) and likely to be
everywhere, the #ifdef is just to help backporters to even earlier
versions (10.00 requires even more of those which I didn't include
here).

Interestingly enough, the CentOS 7 test (which was patched on top of
10.37, because current git doesn't build there anymore) passed all
tests, and only needed a few more of those.

\C is supported with -P in the PCRE version now though, is removing that ok?

> If memory serves grep currently takes care to not pass invalid UTF-8 in
> the buffer or pattern. Does PCRE2_MATCH_INVALID_UTF make this work obsolete?

not sure I understand what you mean, but PCRE2_MATCH_INVALID_UTF is
definitely something that helps with binary search because it will try
harder to do matches in the text found, instead of bailing out with an
error.

it still has to do a scan and verify UTF-8 is valid, so you still have
the performance hit that doing binary matching with a C locale avoids.

Carlo




Information forwarded to bug-grep <at> gnu.org:
bug#47264; Package grep. (Mon, 08 Nov 2021 19:54:02 GMT) Full text and rfc822 format available.

Message #36 received at 47264 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Carlo Arenas <carenas <at> gmail.com>
Cc: 47264 <at> debbugs.gnu.org
Subject: Re: bug#47264: [PATCH] pcre: migrate to pcre2
Date: Mon, 8 Nov 2021 11:53:47 -0800
On 11/8/21 01:47, Carlo Arenas wrote:
> On Sun, Nov 7, 2021 at 4:30 PM Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> Let me know how to help otherwise.

The main thing from my point of view is that I'd like to know what those 
other bugs are. I can split out the patch into pieces if I know what to 
look for.

> I didn't want anyone hitting on those old PCRE2 bugs though with this
> first release, hence why the configure rule is there for now (even if
> I am likely going to remove it for the next version)

If it's a PCRE2 bug we can ask people to fix it in their PCRE2 library.

Possibly we should continue to support PCRE1 as a configure-time option; 
that would assuage concerns about bugs in PCRE2. More work for us, though.

> \C is supported with -P in the PCRE version now though, is removing that ok?

I guess I don't see the harm of supporting \C; why disable it?

>> If memory serves grep currently takes care to not pass invalid UTF-8 in
>> the buffer or pattern. Does PCRE2_MATCH_INVALID_UTF make this work obsolete?
> 
> not sure I understand what you mean

I guess I was thinking about an older grep version.

Currently grep compiles with PCRE_UTF8 and checks for PCRE_ERROR_BADUTF8 
returns from pcre_exec, so it's relying on the count of bytes that this 
pcre_exec returns in sub[0] before calling pcre_exec with 
PCRE_NO_UTF8_CHECK. So, effectively it's using pcre_exec to check that a 
buffer contains valid UTF-8.

I don't see how this works with the proposed patch. It uses sub[0] but I 
don't see how it's set. What am I missing?

One more thing I just noticed: this test:

  if (PCRE2_ERROR_UTF8_ERR1 <= e || e < PCRE2_ERROR_UTF8_ERR21)

is logically equivalent to the following (which is clearer to me):

  if (! (PCRE2_ERROR_UTF8_ERR21 <= e && e < PCRE2_ERROR_UTF8_ERR1))

Shouldn't that be the following instead?

  if (! (PCRE2_ERROR_UTF8_ERR21 <= e && e <= PCRE2_ERROR_UTF8_ERR1))




Information forwarded to bug-grep <at> gnu.org:
bug#47264; Package grep. (Mon, 08 Nov 2021 22:31:01 GMT) Full text and rfc822 format available.

Message #39 received at 47264 <at> debbugs.gnu.org (full text, mbox):

From: Carlo Arenas <carenas <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 47264 <at> debbugs.gnu.org
Subject: Re: bug#47264: [PATCH] pcre: migrate to pcre2
Date: Mon, 8 Nov 2021 14:28:11 -0800
On Mon, Nov 8, 2021 at 11:53 AM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
>
> On 11/8/21 01:47, Carlo Arenas wrote:
> > On Sun, Nov 7, 2021 at 4:30 PM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
>
> > Let me know how to help otherwise.
>
> The main thing from my point of view is that I'd like to know what those
> other bugs are. I can split out the patch into pieces if I know what to
> look for.

The way pcre_jit_stack_alloc is called is incorrect and could leak if
new_size is not enough and PCRE_ERROR_JIT_STACKLIMIT is triggered
again.  The new code calls it only once and sets the maximum to the
same maximum the old code would do after looping.

It doesn't handle silent failures from pcre_study() from old versions,
where JIT might need to be disabled because no RWX page could be
allocated (ex: SELinux enabled and enforcing).

This last one might be better to fix in PCRE2 as it is also version
dependent and behaves differently between PCRE and PCRE2 and the PCRE
version hasn't been needed and won't be needed after the migration,
though.

> > I didn't want anyone hitting on those old PCRE2 bugs though with this
> > first release, hence why the configure rule is there for now (even if
> > I am likely going to remove it for the next version)
>
> If it's a PCRE2 bug we can ask people to fix it in their PCRE2 library.

agree, but it might be worth adding some functionality to allow
workarounds (ex: disabling JIT, which is currently enabled
unconditionally)

> Possibly we should continue to support PCRE1 as a configure-time option;
> that would assuage concerns about bugs in PCRE2. More work for us, though.

We did that in git for a while, but I wouldn't do it now, considering
that PCRE1 is no longer maintained and is arguably more buggy than
PCRE2.

Chances are there is a bugfix available in PCRE2 already that needs
backporting if upgrading their version is not possible.

> > \C is supported with -P in the PCRE version now though, is removing that ok?
>
> I guess I don't see the harm of supporting \C; why disable it?

It is not very useful and interacts badly with the state of UTF-8
matching as it will leave with a partial codeset, which is then
incorrect UTF-8 (if lucky) or could be interpreted incorrectly.

my assumption was that it wasn't disabled by accident, since old PCRE
also default to non UTF-8 by default at compile time, unlike what was
done in PCRE2.

> >> If memory serves grep currently takes care to not pass invalid UTF-8 in
> >> the buffer or pattern. Does PCRE2_MATCH_INVALID_UTF make this work obsolete?
> >
> > not sure I understand what you mean
>
> I guess I was thinking about an older grep version.
>
> Currently grep compiles with PCRE_UTF8 and checks for PCRE_ERROR_BADUTF8
> returns from pcre_exec, so it's relying on the count of bytes that this
> pcre_exec returns in sub[0] before calling pcre_exec with
> PCRE_NO_UTF8_CHECK. So, effectively it's using pcre_exec to check that a
> buffer contains valid UTF-8.

note that it does its own skipping first, but could rely on PCRE2 and
PCRE2_MATCH_INVALID_UTF instead.

> I don't see how this works with the proposed patch. It uses sub[0] but I
> don't see how it's set. What am I missing?

a fix that was on my next version, but I got stuck in the refactoring,
should be (from master) :

-          int valid_bytes = sub[0];
+          PCRE2_SIZE valid_bytes = sub[0] = pcre2_get_startchar (pc->data);

it is set by the pcre2_match() call (which is equivalent to
pcre-exec), but setting it on bad UTF-8 (which is only reported in
versions that don't have PCRE2_MATCH_INVALID_UTF

> One more thing I just noticed: this test:
>
>    if (PCRE2_ERROR_UTF8_ERR1 <= e || e < PCRE2_ERROR_UTF8_ERR21)
>
> is logically equivalent to the following (which is clearer to me):
>
>    if (! (PCRE2_ERROR_UTF8_ERR21 <= e && e < PCRE2_ERROR_UTF8_ERR1))
>
> Shouldn't that be the following instead?
>
>    if (! (PCRE2_ERROR_UTF8_ERR21 <= e && e <= PCRE2_ERROR_UTF8_ERR1))

Makes sense, will include and add a comment to explain better the translation.

Carlo

[1] https://www.pcre.org/original/doc/html/pcre_jit_stack_alloc.html




Information forwarded to bug-grep <at> gnu.org:
bug#47264; Package grep. (Tue, 09 Nov 2021 11:00:02 GMT) Full text and rfc822 format available.

Message #42 received at 47264 <at> debbugs.gnu.org (full text, mbox):

From: Carlo Marcelo Arenas Belón <carenas <at> gmail.com>
To: 47264 <at> debbugs.gnu.org
Subject: [PATCH v2] pcre: migrate to pcre2
Date: Tue, 9 Nov 2021 02:58:49 -0800
[Message part 1 (text/plain, inline)]
Sadly, hadn't been able to generate a release, so testing is somehow
limited to platforms where I can hack a bootstrap (macOS / a few Linux),
but it seems to be ready for some broader testing, specially if the
attached patch is applied on top of a 10.37 release (tested that way
in OpenBSD i386)

Changes since v1:
Resolved:
* Most of the suggested changes by Paul
* no more version restrictions (should work with >~10.20)
* BUG in handling of malformed UTF-8 for PCRE2 < 10.34
Pending:
* what to do with the current support of \C (enabled for now)
* merge of non critical bugfix in #51710[1]

Carlo

[1] https://debbugs.gnu.org/cgi/bugreport.cgi?bug=51710
[v2-0001-pcre-migrate-to-pcre2.patch (text/plain, attachment)]

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Sun, 14 Nov 2021 20:46:02 GMT) Full text and rfc822 format available.

Notification sent to Jaroslav Skarvada <jskarvad <at> redhat.com>:
bug acknowledged by developer. (Sun, 14 Nov 2021 20:46:02 GMT) Full text and rfc822 format available.

Message #47 received at 47264-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Carlo Marcelo Arenas Belón <carenas <at> gmail.com>
Cc: Jaroslav Skarvada <jskarvad <at> redhat.com>, 47264-done <at> debbugs.gnu.org
Subject: Re: bug#47264: [PATCH v2] pcre: migrate to pcre2
Date: Sun, 14 Nov 2021 12:45:29 -0800
[Message part 1 (text/plain, inline)]
On 11/9/21 02:58, Carlo Marcelo Arenas Belón wrote:
> Sadly, hadn't been able to generate a release,

Does this mean you're having trouble running 'make dist'? If so, what's 
the trouble?


> it seems to be ready for some broader testing, specially if the
> attached patch is applied on top of a 10.37 release (tested that way
> in OpenBSD i386)

OK, thanks, I installed it into the Savannah master copy of GNU grep, 
except that I didn't rename m4/pcre.m4 to m4/pcre2.m4, or rename the 
macros to use PCRE2. This made the change easier to audit. Revised patch 
0001 attached.

Also, I followed up with several related patches (also attached as 
0002-0012). Please take a look at them and let us know of any problems. 
In the attached patch "grep: prefer signed integers" I followed the 
usual grep approach of preferring signed to unsigned integers (e.g., 
idx_t to size_t) when either will do; this lets us debug better with 
-fsanitize=undefined to catch integer overflow.

One issue I discovered: PCRE2_EXTRA_MATCH_WORD (which is used by 
pcre2grep -w) is incompatible with 'grep -w'. For example, 'echo a%%a | 
grep -Pw %%' outputs nothing, whereas 'echo a%%a | pcre2grep -w %%' 
outputs 'a%%a'. I think the GNU grep behavior (which is the same as with 
'grep -w', either on Linux or OpenBSD) is more intuitive here: do you 
happen to know why PCRE behaves the way it does? Is that worth a PCRE2 
bug report? Anyway, the attached patches avoid using 
PCRE2_EXTRA_MATCH_WORD for that reason.


> * no more version restrictions (should work with >~10.20)

I tested with 10.00 and found one more glitch (it doesn't have 
PCRE2_SIZE_MAX), which is fixed by the attached patch "grep: port to 
PCRE2 10.20".


> Pending:
> * what to do with the current support of \C (enabled for now)

Let's open another bug report for that; I'm still a bit fuzzy about what 
the pros and cons are.


> * merge of non critical bugfix in #51710[1]

I plan to follow up in that bug report.

Marking this bug as done. Thanks again for working on this.
[0001-grep-migrate-to-pcre2.patch (text/x-patch, attachment)]
[0002-maint-minor-rewording-and-reindenting.patch (text/x-patch, attachment)]
[0003-grep-Don-t-limit-jitstack_max-to-INT_MAX.patch (text/x-patch, attachment)]
[0004-grep-improve-pcre2_get_error_message-comments.patch (text/x-patch, attachment)]
[0005-grep-speed-up-fix-bad-UTF8-check-with-P.patch (text/x-patch, attachment)]
[0006-grep-prefer-signed-integers.patch (text/x-patch, attachment)]
[0007-grep-use-PCRE2_EXTRA_MATCH_LINE.patch (text/x-patch, attachment)]
[0008-grep-simplify-JIT-setup.patch (text/x-patch, attachment)]
[0009-grep-improve-memory-exhaustion-checking-with-P.patch (text/x-patch, attachment)]
[0010-grep-use-ximalloc-not-xcalloc.patch (text/x-patch, attachment)]
[0011-grep-fix-minor-P-memory-leak.patch (text/x-patch, attachment)]
[0012-grep-port-to-PCRE2-10.20.patch (text/x-patch, attachment)]

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Sun, 14 Nov 2021 20:46:02 GMT) Full text and rfc822 format available.

Notification sent to Jim Meyering <jim <at> meyering.net>:
bug acknowledged by developer. (Sun, 14 Nov 2021 20:46:02 GMT) Full text and rfc822 format available.

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Sun, 14 Nov 2021 20:46:03 GMT) Full text and rfc822 format available.

Notification sent to noloader <at> gmail.com:
bug acknowledged by developer. (Sun, 14 Nov 2021 20:46:03 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#47264; Package grep. (Sun, 14 Nov 2021 22:26:02 GMT) Full text and rfc822 format available.

Message #60 received at 47264-done <at> debbugs.gnu.org (full text, mbox):

From: Carlo Arenas <carenas <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Jaroslav Skarvada <jskarvad <at> redhat.com>, 47264-done <at> debbugs.gnu.org
Subject: Re: bug#47264: [PATCH v2] pcre: migrate to pcre2
Date: Sun, 14 Nov 2021 14:25:42 -0800
On Sun, Nov 14, 2021 at 12:45 PM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
>
> On 11/9/21 02:58, Carlo Marcelo Arenas Belón wrote:
> > Sadly, hadn't been able to generate a release,
>
> Does this mean you're having trouble running 'make dist'? If so, what's
> the trouble?

I seem to be unlucky; getting certificate errors in Debian sid, FTBFS
errors when building the info in macOS, but the latest master was able
to run `make dist` successfully in Debian 10, so it is just likely a
PBKAC problem.

> Also, I followed up with several related patches (also attached as
> 0002-0012). Please take a look at them and let us know of any problems.
> In the attached patch "grep: prefer signed integers" I followed the
> usual grep approach of preferring signed to unsigned integers (e.g.,
> idx_t to size_t) when either will do; this lets us debug better with
> -fsanitize=undefined to catch integer overflow.

the one in patch6 where a uint32_t option is doubled, triggers
warnings because of comparing an unsigned variable with 0 AFAIK, but
there are several of those in the upstream gnulib so presumably not a
concern?

using idx_t instead of size_t should be fine (if only halves the max
size of the objects managed), but I am concerned that assuming
PCRE2_SIZE_MAX is always equivalent to SIZE_MAX (as done in patch 4)
might be risky (at least without a comment), and considering that is
part of the API anyway might be better if kept as PCRE2_SIZE_MAX IMHO.

> One issue I discovered: PCRE2_EXTRA_MATCH_WORD (which is used by
> pcre2grep -w) is incompatible with 'grep -w'. For example, 'echo a%%a |
> grep -Pw %%' outputs nothing, whereas 'echo a%%a | pcre2grep -w %%'
> outputs 'a%%a'. I think the GNU grep behavior (which is the same as with
> 'grep -w', either on Linux or OpenBSD) is more intuitive here: do you
> happen to know why PCRE behaves the way it does? Is that worth a PCRE2
> bug report? Anyway, the attached patches avoid using
> PCRE2_EXTRA_MATCH_WORD for that reason.

As I mentioned before, PCRE matches the Perl definition as mentioned
before in an early draft that also had this change reversed.
I would suggest instead that -P should also follow perl convention
instead when used together with -w, but maybe that is something that a
-P feature flag could enable or disable as needed?

Note that "word" definition also has a different meaning in a post
Unicode world, and so I expect that will have to change eventually as
well.

Carlo




Information forwarded to bug-grep <at> gnu.org:
bug#47264; Package grep. (Sun, 14 Nov 2021 22:46:01 GMT) Full text and rfc822 format available.

Message #63 received at 47264-done <at> debbugs.gnu.org (full text, mbox):

From: Jeffrey Walton <noloader <at> gmail.com>
To: Carlo Arenas <carenas <at> gmail.com>
Cc: 47264-done <at> debbugs.gnu.org
Subject: Re: bug#47264: [PATCH v2] pcre: migrate to pcre2
Date: Sun, 14 Nov 2021 17:44:53 -0500
On Sun, Nov 14, 2021 at 5:26 PM Carlo Arenas <carenas <at> gmail.com> wrote:
> On Sun, Nov 14, 2021 at 12:45 PM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> > ...
> using idx_t instead of size_t should be fine (if only halves the max
> size of the objects managed), but I am concerned that assuming
> PCRE2_SIZE_MAX is always equivalent to SIZE_MAX (as done in patch 4)
> might be risky (at least without a comment), and considering that is
> part of the API anyway might be better if kept as PCRE2_SIZE_MAX IMHO.

GNU Hurd may not have SIZE_MAX defined. The Hurd folks
{sometimes|often} want developers to make a runtime call for the
limit.

See docs like https://www.gnu.org/software/libc/manual/html_node/File-Minimums.html
and https://pubs.opengroup.org/onlinepubs/009695399/basedefs/limits.h.html.

Jeff




Information forwarded to bug-grep <at> gnu.org:
bug#47264; Package grep. (Sun, 14 Nov 2021 22:51:02 GMT) Full text and rfc822 format available.

Message #66 received at 47264-done <at> debbugs.gnu.org (full text, mbox):

From: Jeffrey Walton <noloader <at> gmail.com>
To: Carlo Arenas <carenas <at> gmail.com>
Cc: 47264-done <at> debbugs.gnu.org
Subject: Re: bug#47264: [PATCH v2] pcre: migrate to pcre2
Date: Sun, 14 Nov 2021 17:49:32 -0500
On Sun, Nov 14, 2021 at 5:26 PM Carlo Arenas <carenas <at> gmail.com> wrote:
> ...
> I seem to be unlucky; getting certificate errors in Debian sid, FTBFS
> errors when building the info in macOS, but the latest master was able
> to run `make dist` successfully in Debian 10, so it is just likely a
> PBKAC problem.

'sudo dpkg-reconfigure ca-certificates' may help. I needed it a few
months ago to use Wget with GitHub. Also see
https://www.mail-archive.com/ubuntu-devel-discuss <at> lists.ubuntu.com/msg18883.html.

Jeff




Information forwarded to bug-grep <at> gnu.org:
bug#47264; Package grep. (Sun, 14 Nov 2021 23:19:02 GMT) Full text and rfc822 format available.

Message #69 received at 47264-done <at> debbugs.gnu.org (full text, mbox):

From: Carlo Arenas <carenas <at> gmail.com>
To: noloader <at> gmail.com
Cc: 47264-done <at> debbugs.gnu.org
Subject: Re: bug#47264: [PATCH v2] pcre: migrate to pcre2
Date: Sun, 14 Nov 2021 15:18:37 -0800
On Sun, Nov 14, 2021 at 2:45 PM Jeffrey Walton <noloader <at> gmail.com> wrote:
>
> On Sun, Nov 14, 2021 at 5:26 PM Carlo Arenas <carenas <at> gmail.com> wrote:
> > On Sun, Nov 14, 2021 at 12:45 PM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> > > ...
> > using idx_t instead of size_t should be fine (if only halves the max
> > size of the objects managed), but I am concerned that assuming
> > PCRE2_SIZE_MAX is always equivalent to SIZE_MAX (as done in patch 4)
> > might be risky (at least without a comment), and considering that is
> > part of the API anyway might be better if kept as PCRE2_SIZE_MAX IMHO.
>
> GNU Hurd may not have SIZE_MAX defined. The Hurd folks
> {sometimes|often} want developers to make a runtime call for the
> limit.

don't have any GNU Hurd system to test on, but would assume that in
that case PCRE2 failed to build as well and then this code will never
be compiled.

Either way my suggestion to use PCRE2_SIZE_MAX instead (which should
work, except on ancient 10.00 or so) sounds even better, and then the
fix could be made in PCRE2 for both.

Carlo




Information forwarded to bug-grep <at> gnu.org:
bug#47264; Package grep. (Mon, 15 Nov 2021 01:02:01 GMT) Full text and rfc822 format available.

Message #72 received at 47264-done <at> debbugs.gnu.org (full text, mbox):

From: Carlo Arenas <carenas <at> gmail.com>
To: noloader <at> gmail.com
Cc: 47264-done <at> debbugs.gnu.org
Subject: Re: bug#47264: [PATCH v2] pcre: migrate to pcre2
Date: Sun, 14 Nov 2021 17:01:20 -0800
On Sun, Nov 14, 2021 at 3:18 PM Carlo Arenas <carenas <at> gmail.com> wrote:
> On Sun, Nov 14, 2021 at 2:45 PM Jeffrey Walton <noloader <at> gmail.com> wrote:
> > On Sun, Nov 14, 2021 at 5:26 PM Carlo Arenas <carenas <at> gmail.com> wrote:
> > > On Sun, Nov 14, 2021 at 12:45 PM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> > > > ...
> > > using idx_t instead of size_t should be fine (if only halves the max
> > > size of the objects managed), but I am concerned that assuming
> > > PCRE2_SIZE_MAX is always equivalent to SIZE_MAX (as done in patch 4)
> > > might be risky (at least without a comment), and considering that is
> > > part of the API anyway might be better if kept as PCRE2_SIZE_MAX IMHO.
> >
> > GNU Hurd may not have SIZE_MAX defined. The Hurd folks
> > {sometimes|often} want developers to make a runtime call for the
> > limit.
>
> don't have any GNU Hurd system to test on, but would assume that in
> that case PCRE2 failed to build as well and then this code will never
> be compiled.

FWIW got Debian GNU Hurd and it has pcre2 and builds fine against it
because SIZE_MAX is where POSIX said it should (stdint.h)

Carlo




Information forwarded to bug-grep <at> gnu.org:
bug#47264; Package grep. (Mon, 15 Nov 2021 03:07:01 GMT) Full text and rfc822 format available.

Message #75 received at 47264-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: noloader <at> gmail.com, Carlo Arenas <carenas <at> gmail.com>
Cc: 47264-done <at> debbugs.gnu.org
Subject: Re: bug#47264: [PATCH v2] pcre: migrate to pcre2
Date: Sun, 14 Nov 2021 19:06:47 -0800
On 11/14/21 14:44, Jeffrey Walton wrote:
> GNU Hurd may not have SIZE_MAX defined.

I doubt whether that's an issue with recent Hurd. And even if it is, 
Gnulib will arrange for a substitute SIZE_MAX in stdint.h (this is for 
ancient, pre-C99 compilers) that will work just fine. So we're OK here.




Information forwarded to bug-grep <at> gnu.org:
bug#47264; Package grep. (Mon, 15 Nov 2021 03:19:01 GMT) Full text and rfc822 format available.

Message #78 received at 47264 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Carlo Arenas <carenas <at> gmail.com>
Cc: Jaroslav Skarvada <jskarvad <at> redhat.com>, 47264 <at> debbugs.gnu.org
Subject: Re: bug#47264: [PATCH v2] pcre: migrate to pcre2
Date: Sun, 14 Nov 2021 19:17:58 -0800
On 11/14/21 14:25, Carlo Arenas wrote:

> the one in patch6 where a uint32_t option is doubled, triggers
> warnings because of comparing an unsigned variable with 0 AFAIK, but
> there are several of those in the upstream gnulib so presumably not a
> concern?

Yes, that's right. intprops.h can generate tons of bogus warnings with 
older or non-GCC compilers. We typically don't worry about those 
warnings. Recent GCC should be OK here.


> using idx_t instead of size_t should be fine (if only halves the max
> size of the objects managed), but I am concerned that assuming
> PCRE2_SIZE_MAX is always equivalent to SIZE_MAX (as done in patch 4)
> might be risky (at least without a comment), and considering that is
> part of the API anyway might be better if kept as PCRE2_SIZE_MAX IMHO.

This shouldn't be a problem in practice. Surely PCRE2_SIZE_MAX is for 
forward compatibility to a potential future version of PCRE2 that may 
define PCRE2_SIZE to be some other type. For PCRE2 10.20 and earlier 
PCRE2_SIZE is hardwired to size_t, so there is only one plausible 
default for PCRE2_SIZE_MAX, namely SIZE_MAX.


> As I mentioned before, PCRE matches the Perl definition as mentioned
> before in an early draft that also had this change reversed.

I see that PCRE2 documents that PCRE2_EXTRA_MATCH_WORD surrounds the 
pattern with "\b(?:" and ")\b". However, this is bogus: it doesn't 
correspond to the intuitive meaning of "match words", and it doesn't 
correspond to how grep -w behaves for any grep that I know of.

Which "early draft" are you talking about? This appears to be merely a 
bug in libpcre2's documentation and implementation.


> I would suggest instead that -P should also follow perl convention
> instead when used together with -w, but maybe that is something that a
> -P feature flag could enable or disable as needed?

I can't imagine anybody intuitively saying in an English locale that 
"%%" is a word in the string "aa%%aa". PCRE2 is broken, that's all. If a 
user really wants PCRE2's buggy interpretation, they can simply surround 
their regexp with "\b(?:" and ")\b" and not use -w; so there's no need 
to have a different flag for pcre2grep's bizarre interpretation of -w.

Here's another reason why pcre2grep -w is obviously busted:

$ pcre2grep -w ',' <<'EOF'
> a,a
> a, a
> a,
> EOF
a,a

Why is "," a word in the first input line, but not in the second or 
third? pcre2grep is simply wrong here.


> Note that "word" definition also has a different meaning in a post
> Unicode world

Yes, but that's an independent issue.




Information forwarded to bug-grep <at> gnu.org:
bug#47264; Package grep. (Mon, 15 Nov 2021 04:45:01 GMT) Full text and rfc822 format available.

Message #81 received at 47264 <at> debbugs.gnu.org (full text, mbox):

From: Carlo Arenas <carenas <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Jaroslav Skarvada <jskarvad <at> redhat.com>, 47264 <at> debbugs.gnu.org
Subject: Re: bug#47264: [PATCH v2] pcre: migrate to pcre2
Date: Sun, 14 Nov 2021 20:44:33 -0800
On Sun, Nov 14, 2021 at 7:18 PM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 11/14/21 14:25, Carlo Arenas wrote:
> > using idx_t instead of size_t should be fine (if only halves the max
> > size of the objects managed), but I am concerned that assuming
> > PCRE2_SIZE_MAX is always equivalent to SIZE_MAX (as done in patch 4)
> > might be risky (at least without a comment), and considering that is
> > part of the API anyway might be better if kept as PCRE2_SIZE_MAX IMHO.
>
> This shouldn't be a problem in practice. Surely PCRE2_SIZE_MAX is for
> forward compatibility to a potential future version of PCRE2 that may
> define PCRE2_SIZE to be some other type. For PCRE2 10.20 and earlier
> PCRE2_SIZE is hardwired to size_t, so there is only one plausible
> default for PCRE2_SIZE_MAX, namely SIZE_MAX.

which is why I mention that it will be better to at least document
that in a comment, as it was done everywhere else where assumptions
made in the pcre library were used.

Interestingly enough this discussion gave me an idea for a feature in
PCRE where that value will be set to something else than SIZE_MAX and
that might break grep in a future release if it lands.

> > As I mentioned before, PCRE matches the Perl definition as mentioned
> > before in an early draft that also had this change reversed.
>
> I see that PCRE2 documents that PCRE2_EXTRA_MATCH_WORD surrounds the
> pattern with "\b(?:" and ")\b". However, this is bogus: it doesn't
> correspond to the intuitive meaning of "match words", and it doesn't
> correspond to how grep -w behaves for any grep that I know of.

It all comes from what perl defines[1] as a word character (\w), and
that I presume came from the fact it was used earlier for text
processing and most of that text was computer code (which currently
can include unicode).

> Which "early draft" are you talking about? This appears to be merely a
> bug in libpcre2's documentation and implementation.

https://lists.gnu.org/archive/html/grep-devel/2021-10/msg00000.html

> > I would suggest instead that -P should also follow perl convention
> > instead when used together with -w, but maybe that is something that a
> > -P feature flag could enable or disable as needed?
>
> I can't imagine anybody intuitively saying in an English locale that
> "%%" is a word in the string "aa%%aa". PCRE2 is broken, that's all. If a
> user really wants PCRE2's buggy interpretation, they can simply surround
> their regexp with "\b(?:" and ")\b" and not use -w; so there's no need
> to have a different flag for pcre2grep's bizarre interpretation of -w.
>
> Here's another reason why pcre2grep -w is obviously busted:
>
> $ pcre2grep -w ',' <<'EOF'
>  > a,a
>  > a, a
>  > a,
>  > EOF
> a,a
>
> Why is "," a word in the first input line, but not in the second or
> third? pcre2grep is simply wrong here.

that is indeed likely a "bug", but is one that PCRE shares with perl
(and at least JavaScript, Java, Net, Python and Ruby) :

  $ echo 'a,a' | perl -nle '/\b(,)\b/ and print "$1"'
  ,

but it is also because the feature is not being used correctly as ','
is not a word and therefore logically none of them should match

> > Note that "word" definition also has a different meaning in a post
> > Unicode world
>
> Yes, but that's an independent issue.

for the '%' example was not, as it was the fact that it has a Unicode
property indicating is a character used for punctuation as the reason
why it was not matched as expected by grep.

Carlo

[1] https://perldoc.perl.org/perlrebackslash#%5Cb%7B%7D,-%5Cb,-%5CB%7B%7D,-%5CB




Information forwarded to bug-grep <at> gnu.org:
bug#47264; Package grep. (Mon, 15 Nov 2021 16:18:01 GMT) Full text and rfc822 format available.

Message #84 received at 47264 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Carlo Arenas <carenas <at> gmail.com>
Cc: Jaroslav Skarvada <jskarvad <at> redhat.com>, 47264 <at> debbugs.gnu.org
Subject: Re: bug#47264: [PATCH v2] pcre: migrate to pcre2
Date: Mon, 15 Nov 2021 08:17:02 -0800
On 11/14/21 20:44, Carlo Arenas wrote:

>> This shouldn't be a problem in practice. Surely PCRE2_SIZE_MAX is for
>> forward compatibility to a potential future version of PCRE2 that may
>> define PCRE2_SIZE to be some other type. For PCRE2 10.20 and earlier
>> PCRE2_SIZE is hardwired to size_t, so there is only one plausible
>> default for PCRE2_SIZE_MAX, namely SIZE_MAX.
> 
> which is why I mention that it will be better to at least document
> that in a comment, as it was done everywhere else where assumptions
> made in the pcre library were used.

What sort of documentation did you have in mind, exactly?

> Interestingly enough this discussion gave me an idea for a feature in
> PCRE where that value will be set to something else than SIZE_MAX and
> that might break grep in a future release if it lands.

How would it break grep? I'm not following. If a future version of PCRE 
defines PCRE_SIZE_MAX to something other than SIZE_MAX, grep should work 
just fine because it will use what PCRE defines.

>>> As I mentioned before, PCRE matches the Perl definition as mentioned
>>> before in an early draft that also had this change reversed.
>>
>> I see that PCRE2 documents that PCRE2_EXTRA_MATCH_WORD surrounds the
>> pattern with "\b(?:" and ")\b". However, this is bogus: it doesn't
>> correspond to the intuitive meaning of "match words", and it doesn't
>> correspond to how grep -w behaves for any grep that I know of.
> 
> It all comes from what perl defines[1] as a word character (\w)

No it doesn't. It comes merely from how PCRE2 documents and implements 
PCRE2_EXTRA_MATCH_WORD.

Perl's definition of \w does not determine how PCRE2_EXTRA_MATCH_WORD 
behaves; it determines only which characters are word characters and 
which are not. As things stand, PCRE2_EXTRA_MATCH_WORD is bizarre 
because it causes 'pcre2grep -w' to match strings consisting entirely of 
non-word (i.e., non-\w) characters. This cannot be right.


> that is indeed likely a "bug", but is one that PCRE shares with perl
> (and at least JavaScript, Java, Net, Python and Ruby) :
> 
>    $ echo 'a,a' | perl -nle '/\b(,)\b/ and print "$1"'

That is a different issue. \b matches word *boundaries*; it's different 
from -w which is supposed to match *words*. There is indeed a word 
boundary between "a" (a \w character) and "," (a non-\w character), and 
another word boundary between "," and the following "a", but this 
doesn't mean "," is a word.

Attempting to implement -w with \b is a mistake. That mistake is made in 
PCRE2 and the mistake should be corrected. PCRE2 should implement 
PCRE2_EXTRA_MATCH_WORD the same way that grep -P implements -w.




Information forwarded to bug-grep <at> gnu.org:
bug#47264; Package grep. (Mon, 15 Nov 2021 20:50:02 GMT) Full text and rfc822 format available.

Message #87 received at 47264 <at> debbugs.gnu.org (full text, mbox):

From: Carlo Marcelo Arenas Belón <carenas <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Jaroslav Skarvada <jskarvad <at> redhat.com>, 47264 <at> debbugs.gnu.org
Subject: Re: bug#47264: [PATCH v2] pcre: migrate to pcre2
Date: Mon, 15 Nov 2021 12:49:17 -0800
[Message part 1 (text/plain, inline)]
On Mon, Nov 15, 2021 at 08:17:02AM -0800, Paul Eggert wrote:
> On 11/14/21 20:44, Carlo Arenas wrote:
> 
> > > This shouldn't be a problem in practice. Surely PCRE2_SIZE_MAX is for
> > > forward compatibility to a potential future version of PCRE2 that may
> > > define PCRE2_SIZE to be some other type. For PCRE2 10.20 and earlier
> > > PCRE2_SIZE is hardwired to size_t, so there is only one plausible
> > > default for PCRE2_SIZE_MAX, namely SIZE_MAX.
> > 
> > which is why I mention that it will be better to at least document
> > that in a comment, as it was done everywhere else where assumptions
> > made in the pcre library were used.
> 
> What sort of documentation did you have in mind, exactly?

Apologies, I realize it is difficult to talk about code in abstract when
not inlined, but I think it will better addressed by "fixing" it as shown
in the attached patch.

> > Interestingly enough this discussion gave me an idea for a feature in
> > PCRE where that value will be set to something else than SIZE_MAX and
> > that might break grep in a future release if it lands.
> 
> How would it break grep? I'm not following. If a future version of PCRE
> defines PCRE_SIZE_MAX to something other than SIZE_MAX, grep should work
> just fine because it will use what PCRE defines.

only if we always use PCRE2_SIZE_MAX

Carlo
[0001-pcre-avoid-using-SIZE_MAX.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#47264; Package grep. (Mon, 15 Nov 2021 23:25:01 GMT) Full text and rfc822 format available.

Message #90 received at 47264 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Carlo Marcelo Arenas Belón <carenas <at> gmail.com>
Cc: Jaroslav Skarvada <jskarvad <at> redhat.com>, 47264 <at> debbugs.gnu.org
Subject: Re: bug#47264: [PATCH v2] pcre: migrate to pcre2
Date: Mon, 15 Nov 2021 15:24:41 -0800
On 11/15/21 12:49, Carlo Marcelo Arenas Belón wrote:

> Apologies, I realize it is difficult to talk about code in abstract when
> not inlined, but I think it will better addressed by "fixing" it as shown
> in the attached patch.

That patch isn't right, because the relevant code inside libpcre2 uses 
size_t, not PCRE2_SIZE and that means our hacky workaround should use 
SIZE_MAX, not PCRE2_SIZE_MAX, so that it's consistent with libpcre2's 
internals.

Of course this is all academic on all existing platforms, since IDX_MAX 
is way less than SIZE_MAX.




Information forwarded to bug-grep <at> gnu.org:
bug#47264; Package grep. (Tue, 16 Nov 2021 00:12:01 GMT) Full text and rfc822 format available.

Message #93 received at 47264 <at> debbugs.gnu.org (full text, mbox):

From: Carlo Marcelo Arenas Belón <carenas <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Jaroslav Skarvada <jskarvad <at> redhat.com>, 47264 <at> debbugs.gnu.org
Subject: Re: bug#47264: [PATCH v2] pcre: migrate to pcre2
Date: Mon, 15 Nov 2021 16:11:19 -0800
[Message part 1 (text/plain, inline)]
On Mon, Nov 15, 2021 at 03:24:41PM -0800, Paul Eggert wrote:
> On 11/15/21 12:49, Carlo Marcelo Arenas Belón wrote:
> 
> > Apologies, I realize it is difficult to talk about code in abstract when
> > not inlined, but I think it will better addressed by "fixing" it as shown
> > in the attached patch.
> 
> That patch isn't right, because the relevant code inside libpcre2 uses
> size_t, not PCRE2_SIZE and that means our hacky workaround should use
> SIZE_MAX, not PCRE2_SIZE_MAX, so that it's consistent with libpcre2's
> internals.

You are correct, got confused by the changes going to PCRE2_SIZE -> idx, and
should had check, thanks for catching this.

Hopefully the next patch (which I was planning to do with some of the
cleanup you beat me to), will at least clear some possible future valgrind
complain.

> Of course this is all academic on all existing platforms, since IDX_MAX is
> way less than SIZE_MAX.

Agree, and frankly I still think it will be better if IDX_MAX would be used
unconditionally instead.

Carlo
[0001-pcre-only-make-tables-when-needed-and-make-them-alwa.patch (text/plain, attachment)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 14 Dec 2021 12:24:06 GMT) Full text and rfc822 format available.

This bug report was last modified 2 years and 105 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.