GNU bug report logs - #62267
grep-3.9 bug: \d matches multibyte digits

Previous Next

Package: grep;

Reported by: Jim Meyering <jim <at> meyering.net>

Date: Sun, 19 Mar 2023 00:07:01 UTC

Severity: normal

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 62267 in the body.
You can then email your comments to 62267 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#62267; Package grep. (Sun, 19 Mar 2023 00:07:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Jim Meyering <jim <at> meyering.net>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Sun, 19 Mar 2023 00:07:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: bug-grep <at> gnu.org
Subject: grep-3.9 bug: \d matches multibyte digits
Date: Sat, 18 Mar 2023 17:06:37 -0700
[Message part 1 (text/plain, inline)]
I was not happy to discover that with grep-3.9 and -P,
\d can match multibyte digits like the Arabic ones:

  $ LC_ALL=en_US.UTF-8 grep -Po '\d+' <<< '٠١٢٣٤٥٦٧٨٩'
  ٠١٢٣٤٥٦٧٨٩

grep -P has never before done that.
Of course, in the C/POSIX locale, there is no such match:

  $ LC_ALL=C grep -Po '\d+' <<< '٠١٢٣٤٥٦٧٨٩'
  [1]

TL;DR, with the attached fix, grep preprocesses each affected regexp,
changing each eligible "\d" to "[0-9]". Consider this a short-term fix.
Longer term (subject to pcre2 releases), we may instead simply add a
"(?aD)" prefix.  If you really want to match non-ASCII digits, use \p{Nd}.

For background, see the PCRE2 documentation:

  https://www.pcre.org/current/doc/html/pcre2pattern.html
  https://www.pcre.org/current/doc/html/pcre2syntax.html

which say this:

  By default, \d, \s, and \w match only ASCII characters, even in UTF-8
  mode or in the 16-bit and 32-bit libraries. However, if locale-specific
  matching is happening, \s and \w may also match characters with code
  points in the range 128-255. If the PCRE2_UCP option is set, the behaviour
  of these escape sequences is changed to use Unicode properties and they
  match many more characters.

Per upstream pcre2-10.40-112-g6277357, (?aD) does what we want:

  PCRE2_EXTRA_ASCII_BSD: This option forces \d to match only ASCII digits,
  even  when  PCRE2_UCP is  set. It can be changed within a pattern by
  means of the (?aD) option setting.

I used pcre2grep (built from master) to demonstrate how we may eventually use "(?aD)" under the covers:

  $ LC_ALL=en_US.UTF-8 ./pcre2grep --color -u '(?aD)\d' <<< '٠١٢٣٤٥٦٧٨٩'
  [Exit 1]
  $ LC_ALL=en_US.UTF-8 ./pcre2grep --color -u '(?aD)^\d+$' <<< '٠١٢٣٤٥٦٧٨٩'
  ٠١٢٣٤٥٦٧٨٩

For the record, https://github.com/PCRE2Project/pcre2 currently declares
10.42 to be the latest, while there's a commit suggesting it's 10.43.
The difference is important: the 10.43 has support for (?aD), while
10.42 does not.

Incidentally, you can demonstrate this in python3, too:

  $ LC_ALL=en_US.UTF-8 python3 \
    -c "import re; print(re.match(r'\d+', '٠١٢٣٤٥٦٧٨٩'))"
  <re.Match object; span=(0, 10), match='٠١٢٣٤٥٦٧٨٩'>

Use flags=re.ASCII to get the often-desired behavior:

  $ LC_ALL=en_US.UTF-8 python3 \
     -c "import re; print(re.match(r'\d+', '٠١٢٣٤٥٦٧٨٩', flags=re.ASCII))"
  None

This is cause for a new snapshot today and soon thereafter,
the release of grep-3.10.

[grep-multibyte-digits.patch (text/x-patch, inline)]
From 0daefc8c5659e79149a650d97ca12b49ad5e6548 Mon Sep 17 00:00:00 2001
From: Jim Meyering <meyering <at> fb.com>
Date: Sat, 18 Mar 2023 08:28:36 -0700
Subject: [PATCH] grep: -P (--perl-regexp) \d: match only ASCII digits

Prior to grep-3.9, the PCRE matcher had always treated \d just
like [0-9]. grep-3.9's fix for \w and \b mistakenly relaxed \d
to also match multibyte digits.
* src/grep.c (P_MATCHER_INDEX): Define enum.
(pcre_pattern_expand_backslash_d): New function.
(main): Call it for -P.
* NEWS (Bug fixes): Mention it.
* doc/grep.texi: Document it: with -P, \d matches only ASCII digits.
Provide a PCRE documentation URL and an example of how
to use (?s) with -z.
* tests/pcre-ascii-digits: New test.
* tests/Makefile.am (TESTS): Add that file name.
---
 NEWS                    | 10 +++++
 doc/grep.texi           | 31 ++++++++++++++++
 src/grep.c              | 82 ++++++++++++++++++++++++++++++++++++++++-
 tests/Makefile.am       |  1 +
 tests/pcre-ascii-digits | 31 ++++++++++++++++
 5 files changed, 154 insertions(+), 1 deletion(-)
 create mode 100755 tests/pcre-ascii-digits

diff --git a/NEWS b/NEWS
index 803e14b..a24cebd 100644
--- a/NEWS
+++ b/NEWS
@@ -2,6 +2,16 @@ GNU grep NEWS                                    -*- outline -*-

 * Noteworthy changes in release ?.? (????-??-??) [?]

+** Bug fixes
+
+  With -P, \d now matches only ASCII digits, regardless of PCRE
+  options/modes. The changes in grep-3.9 to make \b and \w work
+  properly had the undesirable side effect of making \d also match
+  e.g., the Arabic digits: ٠١٢٣٤٥٦٧٨٩.  With grep-3.9, -P '\d+'
+  would match that ten-digit (20-byte) string. Now, to match such
+  a digit, you would use \p{Nd}.
+  [bug introduced in grep 3.9]
+

 * Noteworthy changes in release 3.9 (2023-03-05) [stable]

diff --git a/doc/grep.texi b/doc/grep.texi
index 621beaf..eaad6e1 100644
--- a/doc/grep.texi
+++ b/doc/grep.texi
@@ -1141,6 +1141,37 @@ combined with the @option{-z} (@option{--null-data}) option, and note that
 @samp{grep@ -P} may warn of unimplemented features.
 @xref{Other Options}.

+For documentation, refer to @url{https://www.pcre.org/}, with these caveats:
+@itemize
+@item
+@samp{\d} always matches only the ten ASCII digits, regardless of locale or
+in-regexp directives like @samp{(?aD)}.
+Use @samp{\p@{Nd@}} if you require to match non-ASCII digits.
+Once pcre2 support for @samp{(?aD)} is widespread enough,
+we expect to make that the default, so it will be overridable.
+@c Using pcre2 git commit pcre2-10.40-112-g6277357, this demonstrates how
+@c we'll prefix with (?aD) to make \d's ASCII-only behavior the default:
+@c $ LC_ALL=en_US.UTF-8 ./pcre2grep -u '(?aD)^\d+' <<< '٠١٢٣٤٥٦٧٨٩'
+@c [Exit 1]
+@c $ LC_ALL=en_US.UTF-8 ./pcre2grep -u '^\d+' <<< '٠١٢٣٤٥٦٧٨٩'
+@c ٠١٢٣٤٥٦٧٨٩
+
+@item
+By default, @command{grep} applies each regexp to a line at a time,
+so the @samp{(?s)} directive (making @samp{.} match line breaks)
+is generally ineffective.
+However, with @option{-z} (@option{--null-data}) it can work:
+@example
+$ printf 'a\nb\n' |grep -zP '(?s)a.b'
+a
+b
+@end example
+But beware: with the @option{-z} (@option{--null-data}) and a file
+containing no NUL byte, grep must read the entire file into memory
+before processing any of it.
+Thus, it will exhaust memory and fail for some large files.
+@end itemize
+
 @end table


diff --git a/src/grep.c b/src/grep.c
index 7547b64..6ba881e 100644
--- a/src/grep.c
+++ b/src/grep.c
@@ -2089,7 +2089,8 @@ static struct
 #endif
 };
 /* Keep these in sync with the 'matchers' table.  */
-enum { E_MATCHER_INDEX = 1, F_MATCHER_INDEX = 2, G_MATCHER_INDEX = 0 };
+enum { E_MATCHER_INDEX = 1, F_MATCHER_INDEX = 2, G_MATCHER_INDEX = 0,
+       P_MATCHER_INDEX = 6 };

 /* Return the index of the matcher corresponding to M if available.
    MATCHER is the index of the previous matcher, or -1 if none.
@@ -2378,6 +2379,80 @@ fgrep_to_grep_pattern (char **keys_p, idx_t *len_p)
   *len_p = p - new_keys;
 }

+/* Replace each \d in *KEYS_P with [0-9], to ensure that \d matches only ASCII
+   digits.  Now that we enable PCRE2_UCP for pcre regexps, \d would otherwise
+   match non-ASCII digits in some locales.  Use \p{Nd} if you require to match
+   those.  */
+static void
+pcre_pattern_expand_backslash_d (char **keys_p, idx_t *len_p)
+{
+  idx_t len = *len_p;
+  char *keys = *keys_p;
+  mbstate_t mb_state = { 0 };
+  char *new_keys = xnmalloc (len / 2 + 1, 5);
+  char *p = new_keys;
+  bool prev_backslash = false;
+
+  for (ptrdiff_t n; len; keys += n, len -= n)
+    {
+      n = mb_clen (keys, len, &mb_state);
+      switch (n)
+        {
+        case -2:
+          n = len;
+          FALLTHROUGH;
+        default:
+          if (prev_backslash)
+            {
+              prev_backslash = false;
+              *p++ = '\\';
+            }
+          p = mempcpy (p, keys, n);
+          break;
+
+        case -1:
+          if (prev_backslash)
+            {
+              prev_backslash = false;
+              *p++ = '\\';
+            }
+          memset (&mb_state, 0, sizeof mb_state);
+          n = 1;
+          FALLTHROUGH;
+        case 1:
+          if (prev_backslash)
+            {
+              prev_backslash = false;
+              switch (*keys)
+                {
+                case 'd':
+                  p = mempcpy (p, "[0-9]", 5);
+                  break;
+                default:
+                  *p++ = '\\';
+                  *p++ = *keys;
+                  break;
+                }
+            }
+          else
+            {
+              if (*keys == '\\')
+                prev_backslash = true;
+              else
+                *p++ = *keys;
+            }
+          break;
+        }
+    }
+
+  if (prev_backslash)
+    *p++ = '\\';
+  *p = '\n';
+  free (*keys_p);
+  *keys_p = new_keys;
+  *len_p = p - new_keys;
+}
+
 /* If it is easy, convert the MATCHER-style patterns KEYS (of size
    *LEN_P) to -F style, update *LEN_P to a possibly-smaller value, and
    return F_MATCHER_INDEX.  If not, leave KEYS and *LEN_P alone and
@@ -2970,6 +3045,11 @@ main (int argc, char **argv)
         matcher = try_fgrep_pattern (matcher, keys, &keycc);
     }

+  /* If -P, replace each \d with [0-9].
+     Those who want to match non-ASCII digits must use \p{Nd}.  */
+  if (matcher == P_MATCHER_INDEX)
+    pcre_pattern_expand_backslash_d (&keys, &keycc);
+
   execute = matchers[matcher].execute;
   compiled_pattern =
     matchers[matcher].compile (keys, keycc, matchers[matcher].syntax,
diff --git a/tests/Makefile.am b/tests/Makefile.am
index a47cf5c..f195c8d 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -139,6 +139,7 @@ TESTS =						\
   options					\
   pcre						\
   pcre-abort					\
+  pcre-ascii-digits				\
   pcre-context					\
   pcre-count					\
   pcre-infloop					\
diff --git a/tests/pcre-ascii-digits b/tests/pcre-ascii-digits
new file mode 100755
index 0000000..ae713f7
--- /dev/null
+++ b/tests/pcre-ascii-digits
@@ -0,0 +1,31 @@
+#!/bin/sh
+# Ensure that grep -P's \d matches only the 10 ASCII digits.
+# With, grep-3.9, \d would match e.g., the multibyte Arabic digits.
+#
+# Copyright (C) 2023 Free Software Foundation, Inc.
+#
+# Copying and distribution of this file, with or without modification,
+# are permitted in any medium without royalty provided the copyright
+# notice and this notice are preserved.
+
+. "${srcdir=.}/init.sh"; path_prepend_ ../src
+require_en_utf8_locale_
+LC_ALL=en_US.UTF-8
+export LC_ALL
+require_pcre_
+
+echo . | grep -qP '(*UTF).' 2>/dev/null \
+  || skip_ 'PCRE unicode support is compiled out'
+
+fail=0
+
+# $ printf %s ٠١٢٣٤٥٦٧٨٩|od -An -to1 -w10 |sed 's/ /\\/g'; : arabic digits
+# \331\240\331\241\331\242\331\243\331\244
+# \331\245\331\246\331\247\331\250\331\251
+printf '\331\240\331\241\331\242\331\243\331\244' > in || framework_failure_
+printf '\331\245\331\246\331\247\331\250\331\251' >> in || framework_failure_
+
+grep -P '\d+' in > out && fail=1
+compare /dev/null out || fail=1
+
+Exit $fail
-- 
2.40.0.rc2


Information forwarded to bug-grep <at> gnu.org:
bug#62267; Package grep. (Sun, 19 Mar 2023 00:40:01 GMT) Full text and rfc822 format available.

Message #8 received at 62267 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>
Cc: 62267 <at> debbugs.gnu.org
Subject: Re: bug#62267: grep-3.9 bug: \d matches multibyte digits
Date: Sat, 18 Mar 2023 17:39:11 -0700
Thanks for looking into this. A couple of questions.

First, some documentation issues. Why is PCRE2 incompatible with Perl on 
this issue? Are there other areas where the two are incompatible? Are 
these incompatibilities documented anywhere? Is the goal for 'grep -P' 
to be compatible with Perl, not with PCRE2?

Second, although that patch focuses on \d, doesn't \D have a similar 
problem and shouldn't it be fixed too?

(OK, that was more than two questions. :-)




Information forwarded to bug-grep <at> gnu.org:
bug#62267; Package grep. (Sun, 19 Mar 2023 05:56:01 GMT) Full text and rfc822 format available.

Message #11 received at 62267 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 62267 <at> debbugs.gnu.org
Subject: Re: bug#62267: grep-3.9 bug: \d matches multibyte digits
Date: Sat, 18 Mar 2023 22:54:42 -0700
On Sat, Mar 18, 2023 at 5:39 PM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> Thanks for looking into this. A couple of questions.
>
> First, some documentation issues. Why is PCRE2 incompatible with Perl on
> this issue? Are there other areas where the two are incompatible?

To be honest, I was not too concerned about keeping up with Perl
and am not worried about divergence, but admit I do not like the
implication, given the name of the option: --perl-regexp. It's always
been "pcre-regexp" in spirit. I suppose we'll want to document that,
eventually.

> Are
> these incompatibilities documented anywhere? Is the goal for 'grep -P'
> to be compatible with Perl, not with PCRE2?

Doesn't Perl have the same issue?
That's why the /a and /aa match modifiers were added.

> Second, although that patch focuses on \d, doesn't \D have a similar
> problem and shouldn't it be fixed too?

Good point about \D. Will adjust.




Information forwarded to bug-grep <at> gnu.org:
bug#62267; Package grep. (Sun, 19 Mar 2023 06:34:02 GMT) Full text and rfc822 format available.

Message #14 received at 62267 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 62267 <at> debbugs.gnu.org
Subject: Re: bug#62267: grep-3.9 bug: \d matches multibyte digits
Date: Sat, 18 Mar 2023 23:33:33 -0700
[Message part 1 (text/plain, inline)]
On Sat, Mar 18, 2023 at 10:54 PM Jim Meyering <jim <at> meyering.net> wrote:
> On Sat, Mar 18, 2023 at 5:39 PM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> > Thanks for looking into this. A couple of questions.
> >
> > First, some documentation issues. Why is PCRE2 incompatible with Perl on
> > this issue? Are there other areas where the two are incompatible?
>
> To be honest, I was not too concerned about keeping up with Perl
> and am not worried about divergence, but admit I do not like the
> implication, given the name of the option: --perl-regexp. It's always
> been "pcre-regexp" in spirit. I suppose we'll want to document that,
> eventually.
>
> > Are
> > these incompatibilities documented anywhere? Is the goal for 'grep -P'
> > to be compatible with Perl, not with PCRE2?
>
> Doesn't Perl have the same issue?
> That's why the /a and /aa match modifiers were added.
>
> > Second, although that patch focuses on \d, doesn't \D have a similar
> > problem and shouldn't it be fixed too?
>
> Good point about \D. Will adjust.

Here's an additional patch to handle \D. I've only just written it, so
it's probably wrong or incomplete somewhere. I'll review it properly
and probably improve it (could certainly add more tests in this area)
tomorrow.

By the way, have you ever used \D? I think I have not.
[grep-multibyte-D.patch (application/octet-stream, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#62267; Package grep. (Sun, 19 Mar 2023 08:29:01 GMT) Full text and rfc822 format available.

Message #17 received at 62267 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>
Cc: 62267 <at> debbugs.gnu.org
Subject: Re: bug#62267: grep-3.9 bug: \d matches multibyte digits
Date: Sun, 19 Mar 2023 01:28:38 -0700
[Message part 1 (text/plain, inline)]
On 2023-03-18 23:33, Jim Meyering wrote:
> By the way, have you ever used \D? I think I have not.

No, I'm not much of a Perl user these days (last seriously used it in 
the 1990s...).

> -  char *new_keys = xnmalloc (len / 2 + 1, 5);
> +  char *new_keys = xnmalloc (len / 2 + 1, 6);

This could be xnmalloc (len + 1, 3).

Or if you want to show the work, you can replace it with something like:

   int origlen = sizeof "\\D" - 1;
   int repllen = sizeof "[^0-9]" - 1;
   int expansion = repllen / origlen + (repllen % origlen != 0);
   char *new_keys = xnmalloc (len + 1, expansion);

(Isn't memory allocation fun? :-)


> Doesn't Perl have the same issue?

Oh, you're right. Not being a Perl expert, all I did was run this:

  echo '٠١٢٣٤٥٦٧٨٩' | perl -ne 'print if /\d/'

and I observed no output. However, I now see that I need to use perl's 
-C option too, to get the kind of regular-expression behavior that plain 
grep has.


Looking at the source code again, how about if we move the PCRE-specific 
changes from src/grep.c to src/pcresearch.c which is where it really 
belongs, and more importantly use the bleeding-edge 
PCRE2_EXTRA_ASCII_BSD macro if available?

Something like the attached patch, say. This patch doesn't take your \D 
fixes (or the above suggestions) into account.
[0001-grep-forward-port-to-PCRE2-10.43.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#62267; Package grep. (Sun, 19 Mar 2023 08:56:02 GMT) Full text and rfc822 format available.

Message #20 received at 62267 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>
Cc: 62267 <at> debbugs.gnu.org
Subject: Re: bug#62267: grep-3.9 bug: \d matches multibyte digits
Date: Sun, 19 Mar 2023 01:54:53 -0700
[Message part 1 (text/plain, inline)]
On 2023-03-19 01:28, Paul Eggert wrote:
> Looking at the source code again, how about if we move the PCRE-specific 
> changes from src/grep.c to src/pcresearch.c which is where it really 
> belongs, and more importantly use the bleeding-edge 
> PCRE2_EXTRA_ASCII_BSD macro if available?
> 
> Something like the attached patch, say. This patch doesn't take your \D 
> fixes (or the above suggestions) into account.

Oops, that patch assumed match_lines. Also, it covered two topics in the 
doc fix. I installed the obvious topic in the doc change, and removed 
the match_lines assumption. Revised patch attached; please ignore the 
patch of a half-hour ago.
[0001-grep-forward-port-to-PCRE2-10.43.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#62267; Package grep. (Sun, 19 Mar 2023 16:56:02 GMT) Full text and rfc822 format available.

Message #23 received at 62267 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 62267 <at> debbugs.gnu.org
Subject: Re: bug#62267: grep-3.9 bug: \d matches multibyte digits
Date: Sun, 19 Mar 2023 09:54:49 -0700
On Sun, Mar 19, 2023 at 1:55 AM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
>
> On 2023-03-19 01:28, Paul Eggert wrote:
> > Looking at the source code again, how about if we move the PCRE-specific
> > changes from src/grep.c to src/pcresearch.c which is where it really
> > belongs, and more importantly use the bleeding-edge
> > PCRE2_EXTRA_ASCII_BSD macro if available?
> >
> > Something like the attached patch, say. This patch doesn't take your \D
> > fixes (or the above suggestions) into account.
>
> Oops, that patch assumed match_lines. Also, it covered two topics in the
> doc fix. I installed the obvious topic in the doc change, and removed
> the match_lines assumption. Revised patch attached; please ignore the
> patch of a half-hour ago.

Thanks. It definitely belongs in pcresearch.c.
You're welcome to push that (or I will soon).
I've rebased my changes on top of it and am adding tests.




Information forwarded to bug-grep <at> gnu.org:
bug#62267; Package grep. (Sun, 19 Mar 2023 20:45:01 GMT) Full text and rfc822 format available.

Message #26 received at 62267 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 62267 <at> debbugs.gnu.org
Subject: Re: bug#62267: grep-3.9 bug: \d matches multibyte digits
Date: Sun, 19 Mar 2023 13:44:37 -0700
[Message part 1 (text/plain, inline)]
On Sun, Mar 19, 2023 at 9:54 AM Jim Meyering <jim <at> meyering.net> wrote:
> On Sun, Mar 19, 2023 at 1:55 AM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> >
> > On 2023-03-19 01:28, Paul Eggert wrote:
> > > Looking at the source code again, how about if we move the PCRE-specific
> > > changes from src/grep.c to src/pcresearch.c which is where it really
> > > belongs, and more importantly use the bleeding-edge
> > > PCRE2_EXTRA_ASCII_BSD macro if available?
> > >
> > > Something like the attached patch, say. This patch doesn't take your \D
> > > fixes (or the above suggestions) into account.
> >
> > Oops, that patch assumed match_lines. Also, it covered two topics in the
> > doc fix. I installed the obvious topic in the doc change, and removed
> > the match_lines assumption. Revised patch attached; please ignore the
> > patch of a half-hour ago.
>
> Thanks. It definitely belongs in pcresearch.c.
> You're welcome to push that (or I will soon).
> I've rebased my changes on top of it and am adding tests.

I've pushed your change along with the attached.
I'll probably create another snapshot today.
[grep-backslash-D.patch (application/octet-stream, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#62267; Package grep. (Sun, 19 Mar 2023 23:13:02 GMT) Full text and rfc822 format available.

Message #29 received at 62267 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>
Cc: 62267 <at> debbugs.gnu.org, Gnulib bugs <bug-gnulib <at> gnu.org>
Subject: Re: bug#62267: grep-3.9 bug: \d matches multibyte digits
Date: Sun, 19 Mar 2023 16:11:56 -0700
[Message part 1 (text/plain, inline)]
On 2023-03-19 13:44, Jim Meyering wrote:
> I've pushed your change along with the attached.
> I'll probably create another snapshot today.

Thanks. I also installed a minor dfa.c change in Gnulib yesterday to 
pacify Oracle Solaris Studio. No big deal since 'grep' builds OK anyway.

I also ran into a weird issue with test-select on Fedora 37 x86-64. It 
appears to be timing dependent and usually doesn't happen. I can't 
reproduce under strace. This is another Gnulib thing and not relevant to 
grep (other than people might report test failures to bug-grep).

I installed into Gnulib the attached patch which shouldn't hurt but 
which I don't know fixes the bug.
[gnulib.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#62267; Package grep. (Sun, 19 Mar 2023 23:19:01 GMT) Full text and rfc822 format available.

Message #32 received at 62267 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 62267 <at> debbugs.gnu.org, Gnulib bugs <bug-gnulib <at> gnu.org>
Subject: Re: bug#62267: grep-3.9 bug: \d matches multibyte digits
Date: Sun, 19 Mar 2023 16:17:54 -0700
On Sun, Mar 19, 2023 at 4:12 PM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 2023-03-19 13:44, Jim Meyering wrote:
> > I've pushed your change along with the attached.
> > I'll probably create another snapshot today.
>
> Thanks. I also installed a minor dfa.c change in Gnulib yesterday to
> pacify Oracle Solaris Studio. No big deal since 'grep' builds OK anyway.
>
> I also ran into a weird issue with test-select on Fedora 37 x86-64. It
> appears to be timing dependent and usually doesn't happen. I can't
> reproduce under strace. This is another Gnulib thing and not relevant to
> grep (other than people might report test failures to bug-grep).
>
> I installed into Gnulib the attached patch which shouldn't hurt but
> which I don't know fixes the bug.

Oh! I must have missed getting the latter by bare minutes.
I've just published another snapshot (which does include the dfa.c change)
but not the select one. We'll get it for the release of 3.10




bug closed, send any further explanations to 62267 <at> debbugs.gnu.org and Jim Meyering <jim <at> meyering.net> Request was from Jim Meyering <jim <at> meyering.net> to control <at> debbugs.gnu.org. (Mon, 20 Mar 2023 05:31:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 17 Apr 2023 11:24:06 GMT) Full text and rfc822 format available.

This bug report was last modified 1 year and 2 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.