GNU bug report logs - #24259
[PATCH 0/6] dfa: thread safety

Previous Next

Package: grep;

Reported by: Zev Weiss <zev <at> bewilderbeest.net>

Date: Thu, 18 Aug 2016 10:51:01 UTC

Severity: normal

Tags: patch

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 24259 in the body.
You can then email your comments to 24259 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#24259; Package grep. (Thu, 18 Aug 2016 10:51:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Zev Weiss <zev <at> bewilderbeest.net>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Thu, 18 Aug 2016 10:51:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Zev Weiss <zev <at> bewilderbeest.net>
To: bug-grep <at> gnu.org
Subject: [PATCH 0/6] dfa: thread safety
Date: Thu, 18 Aug 2016 05:50:13 -0500
Hello,

Re: Jim Meyering's proposal to move grep's dfa.c into gnulib, I was
hoping I might be able to sneak in the dfa-related parts of my
multithreading patch set before it happens.

These are all the patches from that set that touch src/dfa.[ch], and
are all oriented toward making dfa.c thread-safe (which seems like it
might be a desirable thing for librarification anyway).  Starting with
the fourth patch there are some slight changes to the external API, so
other projects using dfa.c may need some minor corresponding
adjustments as a result (but nothing, I think, too onerous).  The
first three should have no impact on any users, however.

(The bulk of the changes in these patches are fairly mechanical; for
some of them something like `git show --color-words' on a local branch
might make the review process slightly less tedious.)

Combined diffstat for these six follows.

Thanks,
Zev Weiss

 src/dfa.c             | 854 ++++++++++++++++++++++--------------------
 src/dfa.h             |  13 +-
 src/dfasearch.c       |   7 +-
 src/grep.c            |   2 +
 src/kwsearch.c        |   2 +-
 src/pcresearch.c      |   2 +-
 tests/dfa-match-aux.c |   4 +-
 7 files changed, 470 insertions(+), 414 deletions(-)





Information forwarded to bug-grep <at> gnu.org:
bug#24259; Package grep. (Thu, 18 Aug 2016 10:52:01 GMT) Full text and rfc822 format available.

Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Zev Weiss <zev <at> bewilderbeest.net>
To: bug-grep <at> gnu.org
Cc: Zev Weiss <zev <at> bewilderbeest.net>
Subject: [PATCH 3/6] dfa: thread-safety: move parser state into struct dfa
Date: Thu, 18 Aug 2016 05:50:16 -0500
* src/dfa.c: move global variables holding parser state (`tok' and
`depth') into a new struct (`struct parser_state') and add an instance
of it to struct dfa.  All references to the globals are replaced by
references to the dfa struct's new member.
---
 src/dfa.c | 92 ++++++++++++++++++++++++++++++++++++---------------------------
 1 file changed, 52 insertions(+), 40 deletions(-)

diff --git a/src/dfa.c b/src/dfa.c
index d100578..858bc55 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -351,6 +351,18 @@ struct lexer_state
                      MB_CUR_MAX > 1.  */
 };
 
+/* Recursive descent parser for regular expressions.  */
+
+struct parser_state
+{
+  token tok;               /* Lookahead token.  */
+  size_t depth;            /* Current depth of a hypothetical stack
+                              holding deferred productions.  This is
+                              used to determine the depth that will be
+                              required of the real stack later on in
+                              dfaanalyze.  */
+};
+
 /* A compiled regular expression.  */
 struct dfa
 {
@@ -362,6 +374,9 @@ struct dfa
   /* Scanner state */
   struct lexer_state lexstate;
 
+  /* Parser state */
+  struct parser_state parsestate;
+
   /* Fields filled by the parser.  */
   token *tokens;                /* Postfix parse array.  */
   size_t tindex;                /* Index for adding new tokens.  */
@@ -1584,15 +1599,6 @@ lex (struct dfa *dfa)
   return END;                   /* keeps pedantic compilers happy.  */
 }
 
-/* Recursive descent parser for regular expressions.  */
-
-static token tok;               /* Lookahead token.  */
-static size_t depth;            /* Current depth of a hypothetical stack
-                                   holding deferred productions.  This is
-                                   used to determine the depth that will be
-                                   required of the real stack later on in
-                                   dfaanalyze.  */
-
 static void
 addtok_mb (struct dfa *dfa, token t, int mbprop)
 {
@@ -1617,7 +1623,7 @@ addtok_mb (struct dfa *dfa, token t, int mbprop)
 
     case CAT:
     case OR:
-      --depth;
+      --dfa->parsestate.depth;
       break;
 
     case BACKREF:
@@ -1627,11 +1633,11 @@ addtok_mb (struct dfa *dfa, token t, int mbprop)
       ++dfa->nleaves;
       /* fallthrough */
     case EMPTY:
-      ++depth;
+      ++dfa->parsestate.depth;
       break;
     }
-  if (depth > dfa->depth)
-    dfa->depth = depth;
+  if (dfa->parsestate.depth > dfa->depth)
+    dfa->depth = dfa->parsestate.depth;
 }
 
 static void addtok_wc (struct dfa *dfa, wint_t wc);
@@ -1801,7 +1807,7 @@ add_utf8_anychar (struct dfa *dfa)
 static void
 atom (struct dfa *dfa)
 {
-  if (tok == WCHAR)
+  if (dfa->parsestate.tok == WCHAR)
     {
       if (dfa->lexstate.wctok == WEOF)
         addtok (dfa, BACKREF);
@@ -1822,9 +1828,9 @@ atom (struct dfa *dfa)
             }
         }
 
-      tok = lex (dfa);
+      dfa->parsestate.tok = lex (dfa);
     }
-  else if (tok == ANYCHAR && using_utf8 ())
+  else if (dfa->parsestate.tok == ANYCHAR && using_utf8 ())
     {
       /* For UTF-8 expand the period to a series of CSETs that define a valid
          UTF-8 character.  This avoids using the slow multibyte path.  I'm
@@ -1834,23 +1840,26 @@ atom (struct dfa *dfa)
          UTF-8: it is the most used, and the structure of the encoding
          makes the correctness more obvious.  */
       add_utf8_anychar (dfa);
-      tok = lex (dfa);
+      dfa->parsestate.tok = lex (dfa);
     }
-  else if ((tok >= 0 && tok < NOTCHAR) || tok >= CSET || tok == BACKREF
-           || tok == BEGLINE || tok == ENDLINE || tok == BEGWORD
-           || tok == ANYCHAR || tok == MBCSET
-           || tok == ENDWORD || tok == LIMWORD || tok == NOTLIMWORD)
+  else if ((dfa->parsestate.tok >= 0 && dfa->parsestate.tok < NOTCHAR)
+           || dfa->parsestate.tok >= CSET || dfa->parsestate.tok == BACKREF
+           || dfa->parsestate.tok == BEGLINE || dfa->parsestate.tok == ENDLINE
+           || dfa->parsestate.tok == BEGWORD || dfa->parsestate.tok == ANYCHAR
+           || dfa->parsestate.tok == MBCSET || dfa->parsestate.tok == ENDWORD
+           || dfa->parsestate.tok == LIMWORD
+           || dfa->parsestate.tok == NOTLIMWORD)
     {
-      addtok (dfa, tok);
-      tok = lex (dfa);
+      addtok (dfa, dfa->parsestate.tok);
+      dfa->parsestate.tok = lex (dfa);
     }
-  else if (tok == LPAREN)
+  else if (dfa->parsestate.tok == LPAREN)
     {
-      tok = lex (dfa);
+      dfa->parsestate.tok = lex (dfa);
       regexp (dfa);
-      if (tok != RPAREN)
+      if (dfa->parsestate.tok != RPAREN)
         dfaerror (_("unbalanced ("));
-      tok = lex (dfa);
+      dfa->parsestate.tok = lex (dfa);
     }
   else
     addtok (dfa, EMPTY);
@@ -1898,8 +1907,10 @@ closure (struct dfa *dfa)
   size_t tindex, ntokens;
 
   atom (dfa);
-  while (tok == QMARK || tok == STAR || tok == PLUS || tok == REPMN)
-    if (tok == REPMN && (dfa->lexstate.minrep || dfa->lexstate.maxrep))
+  while (dfa->parsestate.tok == QMARK || dfa->parsestate.tok == STAR
+         || dfa->parsestate.tok == PLUS || dfa->parsestate.tok == REPMN)
+    if (dfa->parsestate.tok == REPMN
+        && (dfa->lexstate.minrep || dfa->lexstate.maxrep))
       {
         ntokens = nsubtoks (dfa, dfa->tindex);
         tindex = dfa->tindex - ntokens;
@@ -1918,18 +1929,18 @@ closure (struct dfa *dfa)
             addtok (dfa, QMARK);
             addtok (dfa, CAT);
           }
-        tok = lex (dfa);
+        dfa->parsestate.tok = lex (dfa);
       }
-    else if (tok == REPMN)
+    else if (dfa->parsestate.tok == REPMN)
       {
         dfa->tindex -= nsubtoks (dfa, dfa->tindex);
-        tok = lex (dfa);
+        dfa->parsestate.tok = lex (dfa);
         closure (dfa);
       }
     else
       {
-        addtok (dfa, tok);
-        tok = lex (dfa);
+        addtok (dfa, dfa->parsestate.tok);
+        dfa->parsestate.tok = lex (dfa);
       }
 }
 
@@ -1937,7 +1948,8 @@ static void
 branch (struct dfa* dfa)
 {
   closure (dfa);
-  while (tok != RPAREN && tok != OR && tok >= 0)
+  while (dfa->parsestate.tok != RPAREN && dfa->parsestate.tok != OR
+         && dfa->parsestate.tok >= 0)
     {
       closure (dfa);
       addtok (dfa, CAT);
@@ -1948,9 +1960,9 @@ static void
 regexp (struct dfa *dfa)
 {
   branch (dfa);
-  while (tok == OR)
+  while (dfa->parsestate.tok == OR)
     {
-      tok = lex (dfa);
+      dfa->parsestate.tok = lex (dfa);
       branch (dfa);
       addtok (dfa, OR);
     }
@@ -1976,12 +1988,12 @@ dfaparse (char const *s, size_t len, struct dfa *d)
   if (!syntax_bits_set)
     dfaerror (_("no syntax specified"));
 
-  tok = lex (d);
-  depth = d->depth;
+  d->parsestate.tok = lex (d);
+  d->parsestate.depth = d->depth;
 
   regexp (d);
 
-  if (tok != END)
+  if (d->parsestate.tok != END)
     dfaerror (_("unbalanced )"));
 
   addtok (d, END - d->nregexps);
-- 
2.8.1





Information forwarded to bug-grep <at> gnu.org:
bug#24259; Package grep. (Thu, 18 Aug 2016 10:52:01 GMT) Full text and rfc822 format available.

Message #11 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Zev Weiss <zev <at> bewilderbeest.net>
To: bug-grep <at> gnu.org
Cc: Zev Weiss <zev <at> bewilderbeest.net>
Subject: [PATCH 6/6] dfa: thread-safety: initialize mbrtowc_cache in dfa_init()
Date: Thu, 18 Aug 2016 05:50:19 -0500
* src/dfa.c (dfasyntax): Remove initialization of mbrtowc_cache.
(init_mbrtowc_cache): New function.
(dfa_init): Call init_mbrtowc_cache().
---
 src/dfa.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/src/dfa.c b/src/dfa.c
index 970b51f..e577393 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -766,6 +766,20 @@ dfa_using_utf8 (void)
   return using_utf8;
 }
 
+static void
+init_mbrtowc_cache (void)
+{
+  int i;
+  for (i = CHAR_MIN; i <= CHAR_MAX; ++i)
+    {
+      char c = i;
+      unsigned char uc = i;
+      mbstate_t s = { 0 };
+      wchar_t wc;
+      mbrtowc_cache[uc] = mbrtowc (&wc, &c, 1, &s) <= 1 ? wc : WEOF;
+    }
+}
+
 /* Entry point to set syntax options.  */
 void
 dfasyntax (struct dfa *dfa, reg_syntax_t bits, bool fold, unsigned char eol)
@@ -778,13 +792,9 @@ dfasyntax (struct dfa *dfa, reg_syntax_t bits, bool fold, unsigned char eol)
 
   for (i = CHAR_MIN; i <= CHAR_MAX; ++i)
     {
-      char c = i;
       unsigned char uc = i;
-      mbstate_t s = { 0 };
-      wchar_t wc;
-      mbrtowc_cache[uc] = mbrtowc (&wc, &c, 1, &s) <= 1 ? wc : WEOF;
 
-      /* Now that mbrtowc_cache[uc] is set, use it to calculate sbit.  */
+      /* Use mbrtowc_cache to calculate sbit.  */
       dfa->syntax.sbit[uc] = char_context (dfa, uc);
       switch (dfa->syntax.sbit[uc])
         {
@@ -4203,6 +4213,7 @@ dfa_init (void)
 {
   check_utf8 ();
   check_unibyte_c ();
+  init_mbrtowc_cache ();
 }
 
 /* vim:set shiftwidth=2: */
-- 
2.8.1





Information forwarded to bug-grep <at> gnu.org:
bug#24259; Package grep. (Thu, 18 Aug 2016 10:52:02 GMT) Full text and rfc822 format available.

Message #14 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Zev Weiss <zev <at> bewilderbeest.net>
To: bug-grep <at> gnu.org
Cc: Zev Weiss <zev <at> bewilderbeest.net>
Subject: [PATCH 2/6] dfa: thread-safety: move lexer state into struct dfa
Date: Thu, 18 Aug 2016 05:50:15 -0500
* src/dfa.c: move global variables holding lexer state into a new
struct (`struct lexer_state') and add an instance of this struct to
struct dfa.  All references to the globals are replaced with
references to the dfa struct's new member.
---
 src/dfa.c | 294 ++++++++++++++++++++++++++++++++++----------------------------
 1 file changed, 161 insertions(+), 133 deletions(-)

diff --git a/src/dfa.c b/src/dfa.c
index 5bd2a92..d100578 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -328,6 +328,29 @@ struct mb_char_classes
   size_t nchars;
 };
 
+/* Lexical analyzer.  All the dross that deals with the obnoxious
+   GNU Regex syntax bits is located here.  The poor, suffering
+   reader is referred to the GNU Regex documentation for the
+   meaning of the @#%!@#%^!@ syntax bits.  */
+struct lexer_state
+{
+  char const *lexptr;      /* Pointer to next input character.  */
+  size_t lexleft;          /* Number of characters remaining.  */
+  token lasttok;           /* Previous token returned; initially END.  */
+  bool laststart;		/* We're separated from beginning or (,
+                       | only by zero-width characters.  */
+  size_t parens;           /* Count of outstanding left parens.  */
+  int minrep, maxrep;      /* Repeat counts for {m,n}.  */
+
+  int cur_mb_len;      /* Length of the multibyte representation of
+                          wctok.  */
+
+  wint_t wctok;		/* Wide character representation of the current
+                     multibyte character, or WEOF if there was
+                     an encoding error.  Used only if
+                     MB_CUR_MAX > 1.  */
+};
+
 /* A compiled regular expression.  */
 struct dfa
 {
@@ -336,6 +359,9 @@ struct dfa
   size_t cindex;                /* Index for adding new charclasses.  */
   size_t calloc;                /* Number of charclasses allocated.  */
 
+  /* Scanner state */
+  struct lexer_state lexstate;
+
   /* Fields filled by the parser.  */
   token *tokens;                /* Postfix parse array.  */
   size_t tindex;                /* Index for adding new tokens.  */
@@ -831,28 +857,6 @@ using_simple_locale (struct dfa *dfa)
     }
 }
 
-/* Lexical analyzer.  All the dross that deals with the obnoxious
-   GNU Regex syntax bits is located here.  The poor, suffering
-   reader is referred to the GNU Regex documentation for the
-   meaning of the @#%!@#%^!@ syntax bits.  */
-
-static char const *lexptr;      /* Pointer to next input character.  */
-static size_t lexleft;          /* Number of characters remaining.  */
-static token lasttok;           /* Previous token returned; initially END.  */
-static bool laststart;		/* We're separated from beginning or (,
-                                   | only by zero-width characters.  */
-static size_t parens;           /* Count of outstanding left parens.  */
-static int minrep, maxrep;      /* Repeat counts for {m,n}.  */
-
-static int cur_mb_len = 1;      /* Length of the multibyte representation of
-                                   wctok.  */
-
-static wint_t wctok;		/* Wide character representation of the current
-                                   multibyte character, or WEOF if there was
-                                   an encoding error.  Used only if
-                                   MB_CUR_MAX > 1.  */
-
-
 /* Fetch the next lexical input character.  Set C (of type int) to the
    next input byte, except set C to EOF if the input is a multibyte
    character of length greater than 1.  Set WC (of type wint_t) to the
@@ -862,22 +866,23 @@ static wint_t wctok;		/* Wide character representation of the current
    otherwise.  */
 # define FETCH_WC(dfa, c, wc, eoferr)		\
   do {						\
-    if (! lexleft)				\
+    if (! dfa->lexstate.lexleft)		\
       {						\
         if ((eoferr) != 0)			\
           dfaerror (eoferr);			\
         else					\
-          return lasttok = END;			\
+          return dfa->lexstate.lasttok = END;	\
       }						\
     else					\
       {						\
         wint_t _wc;				\
-        size_t nbytes = mbs_to_wchar (&_wc, lexptr, lexleft, dfa); \
-        cur_mb_len = nbytes;			\
+        size_t nbytes = mbs_to_wchar (&_wc, dfa->lexstate.lexptr, \
+                                      dfa->lexstate.lexleft, dfa); \
+        dfa->lexstate.cur_mb_len = nbytes;	\
         (wc) = _wc;				\
-        (c) = nbytes == 1 ? to_uchar (*lexptr) : EOF;    \
-        lexptr += nbytes;			\
-        lexleft -= nbytes;			\
+        (c) = nbytes == 1 ? to_uchar (*dfa->lexstate.lexptr) : EOF; \
+        dfa->lexstate.lexptr += nbytes;		\
+        dfa->lexstate.lexleft -= nbytes;	\
       }						\
   } while (false)
 
@@ -1051,7 +1056,8 @@ parse_bracket_exp (struct dfa *dfa)
               for (;;)
                 {
                   FETCH_WC (dfa, c, wc, _("unbalanced ["));
-                  if ((c == c1 && *lexptr == ']') || lexleft == 0)
+                  if ((c == c1 && *dfa->lexstate.lexptr == ']')
+                      || dfa->lexstate.lexleft == 0)
                     break;
                   if (len < MAX_BRACKET_STRING_LEN)
                     str[len++] = c;
@@ -1111,7 +1117,7 @@ parse_bracket_exp (struct dfa *dfa)
           /* A bracket expression like [a-[.aa.]] matches an unknown set.
              Treat it like [-a[.aa.]] while parsing it, and
              remember that the set is unknown.  */
-          if (c2 == '[' && *lexptr == '.')
+          if (c2 == '[' && *dfa->lexstate.lexptr == '.')
             {
               known_bracket_exp = false;
               c2 = ']';
@@ -1121,8 +1127,8 @@ parse_bracket_exp (struct dfa *dfa)
             {
               /* In the case [x-], the - is an ordinary hyphen,
                  which is left in c1, the lookahead character.  */
-              lexptr -= cur_mb_len;
-              lexleft += cur_mb_len;
+              dfa->lexstate.lexptr -= dfa->lexstate.cur_mb_len;
+              dfa->lexstate.lexleft += dfa->lexstate.cur_mb_len;
             }
           else
             {
@@ -1222,14 +1228,14 @@ parse_bracket_exp (struct dfa *dfa)
 #define PUSH_LEX_STATE(s)			\
   do						\
     {						\
-      char const *lexptr_saved = lexptr;	\
-      size_t lexleft_saved = lexleft;		\
-      lexptr = (s);				\
-      lexleft = strlen (lexptr)
+      char const *lexptr_saved = dfa->lexstate.lexptr;	\
+      size_t lexleft_saved = dfa->lexstate.lexleft;		\
+      dfa->lexstate.lexptr = (s);				\
+      dfa->lexstate.lexleft = strlen (dfa->lexstate.lexptr)
 
 #define POP_LEX_STATE()				\
-      lexptr = lexptr_saved;			\
-      lexleft = lexleft_saved;			\
+      dfa->lexstate.lexptr = lexptr_saved;			\
+      dfa->lexstate.lexleft = lexleft_saved;			\
     }						\
   while (false)
 
@@ -1249,14 +1255,14 @@ lex (struct dfa *dfa)
      "if (backslash) ...".  */
   for (i = 0; i < 2; ++i)
     {
-      FETCH_WC (dfa, c, wctok, NULL);
+      FETCH_WC (dfa, c, dfa->lexstate.wctok, NULL);
 
       switch (c)
         {
         case '\\':
           if (backslash)
             goto normal_char;
-          if (lexleft == 0)
+          if (dfa->lexstate.lexleft == 0)
             dfaerror (_("unfinished \\ escape"));
           backslash = true;
           break;
@@ -1265,24 +1271,28 @@ lex (struct dfa *dfa)
           if (backslash)
             goto normal_char;
           if (syntax_bits & RE_CONTEXT_INDEP_ANCHORS
-              || lasttok == END || lasttok == LPAREN || lasttok == OR)
-            return lasttok = BEGLINE;
+              || dfa->lexstate.lasttok == END || dfa->lexstate.lasttok == LPAREN
+              || dfa->lexstate.lasttok == OR)
+            return dfa->lexstate.lasttok = BEGLINE;
           goto normal_char;
 
         case '$':
           if (backslash)
             goto normal_char;
           if (syntax_bits & RE_CONTEXT_INDEP_ANCHORS
-              || lexleft == 0
+              || dfa->lexstate.lexleft == 0
               || (syntax_bits & RE_NO_BK_PARENS
-                  ? lexleft > 0 && *lexptr == ')'
-                  : lexleft > 1 && lexptr[0] == '\\' && lexptr[1] == ')')
+                  ? dfa->lexstate.lexleft > 0 && *dfa->lexstate.lexptr == ')'
+                  : dfa->lexstate.lexleft > 1 && dfa->lexstate.lexptr[0] == '\\'
+                    && dfa->lexstate.lexptr[1] == ')')
               || (syntax_bits & RE_NO_BK_VBAR
-                  ? lexleft > 0 && *lexptr == '|'
-                  : lexleft > 1 && lexptr[0] == '\\' && lexptr[1] == '|')
+                  ? dfa->lexstate.lexleft > 0 && *dfa->lexstate.lexptr == '|'
+                  : dfa->lexstate.lexleft > 1 && dfa->lexstate.lexptr[0] == '\\'
+                    && dfa->lexstate.lexptr[1] == '|')
               || ((syntax_bits & RE_NEWLINE_ALT)
-                  && lexleft > 0 && *lexptr == '\n'))
-            return lasttok = ENDLINE;
+                  && dfa->lexstate.lexleft > 0
+                  && *dfa->lexstate.lexptr == '\n'))
+            return dfa->lexstate.lasttok = ENDLINE;
           goto normal_char;
 
         case '1':
@@ -1296,39 +1306,45 @@ lex (struct dfa *dfa)
         case '9':
           if (backslash && !(syntax_bits & RE_NO_BK_REFS))
             {
-              laststart = false;
-              return lasttok = BACKREF;
+              dfa->lexstate.laststart = false;
+              return dfa->lexstate.lasttok = BACKREF;
             }
           goto normal_char;
 
         case '`':
           if (backslash && !(syntax_bits & RE_NO_GNU_OPS))
-            return lasttok = BEGLINE; /* FIXME: should be beginning of string */
+            {
+              /* FIXME: should be beginning of string */
+              return dfa->lexstate.lasttok = BEGLINE;
+            }
           goto normal_char;
 
         case '\'':
           if (backslash && !(syntax_bits & RE_NO_GNU_OPS))
-            return lasttok = ENDLINE;   /* FIXME: should be end of string */
+            {
+              /* FIXME: should be end of string */
+              return dfa->lexstate.lasttok = ENDLINE;
+            }
           goto normal_char;
 
         case '<':
           if (backslash && !(syntax_bits & RE_NO_GNU_OPS))
-            return lasttok = BEGWORD;
+            return dfa->lexstate.lasttok = BEGWORD;
           goto normal_char;
 
         case '>':
           if (backslash && !(syntax_bits & RE_NO_GNU_OPS))
-            return lasttok = ENDWORD;
+            return dfa->lexstate.lasttok = ENDWORD;
           goto normal_char;
 
         case 'b':
           if (backslash && !(syntax_bits & RE_NO_GNU_OPS))
-            return lasttok = LIMWORD;
+            return dfa->lexstate.lasttok = LIMWORD;
           goto normal_char;
 
         case 'B':
           if (backslash && !(syntax_bits & RE_NO_GNU_OPS))
-            return lasttok = NOTLIMWORD;
+            return dfa->lexstate.lasttok = NOTLIMWORD;
           goto normal_char;
 
         case '?':
@@ -1336,32 +1352,32 @@ lex (struct dfa *dfa)
             goto normal_char;
           if (backslash != ((syntax_bits & RE_BK_PLUS_QM) != 0))
             goto normal_char;
-          if (!(syntax_bits & RE_CONTEXT_INDEP_OPS) && laststart)
+          if (!(syntax_bits & RE_CONTEXT_INDEP_OPS) && dfa->lexstate.laststart)
             goto normal_char;
-          return lasttok = QMARK;
+          return dfa->lexstate.lasttok = QMARK;
 
         case '*':
           if (backslash)
             goto normal_char;
-          if (!(syntax_bits & RE_CONTEXT_INDEP_OPS) && laststart)
+          if (!(syntax_bits & RE_CONTEXT_INDEP_OPS) && dfa->lexstate.laststart)
             goto normal_char;
-          return lasttok = STAR;
+          return dfa->lexstate.lasttok = STAR;
 
         case '+':
           if (syntax_bits & RE_LIMITED_OPS)
             goto normal_char;
           if (backslash != ((syntax_bits & RE_BK_PLUS_QM) != 0))
             goto normal_char;
-          if (!(syntax_bits & RE_CONTEXT_INDEP_OPS) && laststart)
+          if (!(syntax_bits & RE_CONTEXT_INDEP_OPS) && dfa->lexstate.laststart)
             goto normal_char;
-          return lasttok = PLUS;
+          return dfa->lexstate.lasttok = PLUS;
 
         case '{':
           if (!(syntax_bits & RE_INTERVALS))
             goto normal_char;
           if (backslash != ((syntax_bits & RE_NO_BK_BRACES) == 0))
             goto normal_char;
-          if (!(syntax_bits & RE_CONTEXT_INDEP_OPS) && laststart)
+          if (!(syntax_bits & RE_CONTEXT_INDEP_OPS) && dfa->lexstate.laststart)
             goto normal_char;
 
           /* Cases:
@@ -1371,79 +1387,86 @@ lex (struct dfa *dfa)
              {,} - 0 to infinity (same as '*')
              {M,N} - M through N */
           {
-            char const *p = lexptr;
-            char const *lim = p + lexleft;
-            minrep = maxrep = -1;
+            char const *p = dfa->lexstate.lexptr;
+            char const *lim = p + dfa->lexstate.lexleft;
+            dfa->lexstate.minrep = dfa->lexstate.maxrep = -1;
             for (; p != lim && ISASCIIDIGIT (*p); p++)
               {
-                if (minrep < 0)
-                  minrep = *p - '0';
+                if (dfa->lexstate.minrep < 0)
+                  dfa->lexstate.minrep = *p - '0';
                 else
-                  minrep = MIN (RE_DUP_MAX + 1, minrep * 10 + *p - '0');
+                  dfa->lexstate.minrep = MIN (RE_DUP_MAX + 1,
+                                              (dfa->lexstate.minrep
+                                               * 10 + *p - '0'));
               }
             if (p != lim)
               {
                 if (*p != ',')
-                  maxrep = minrep;
+                  dfa->lexstate.maxrep = dfa->lexstate.minrep;
                 else
                   {
-                    if (minrep < 0)
-                      minrep = 0;
+                    if (dfa->lexstate.minrep < 0)
+                      dfa->lexstate.minrep = 0;
                     while (++p != lim && ISASCIIDIGIT (*p))
                       {
-                        if (maxrep < 0)
-                          maxrep = *p - '0';
+                        if (dfa->lexstate.maxrep < 0)
+                          dfa->lexstate.maxrep = *p - '0';
                         else
-                          maxrep = MIN (RE_DUP_MAX + 1, maxrep * 10 + *p - '0');
+                          dfa->lexstate.maxrep = MIN (RE_DUP_MAX + 1,
+                                                      (dfa->lexstate.maxrep
+                                                       * 10 + *p - '0'));
                       }
                   }
               }
             if (! ((! backslash || (p != lim && *p++ == '\\'))
                    && p != lim && *p++ == '}'
-                   && 0 <= minrep && (maxrep < 0 || minrep <= maxrep)))
+                   && 0 <= dfa->lexstate.minrep
+                   && (dfa->lexstate.maxrep < 0
+                       || dfa->lexstate.minrep <= dfa->lexstate.maxrep)))
               {
                 if (syntax_bits & RE_INVALID_INTERVAL_ORD)
                   goto normal_char;
                 dfaerror (_("invalid content of \\{\\}"));
               }
-            if (RE_DUP_MAX < maxrep)
+            if (RE_DUP_MAX < dfa->lexstate.maxrep)
               dfaerror (_("regular expression too big"));
-            lexptr = p;
-            lexleft = lim - p;
+            dfa->lexstate.lexptr = p;
+            dfa->lexstate.lexleft = lim - p;
           }
-          laststart = false;
-          return lasttok = REPMN;
+          dfa->lexstate.laststart = false;
+          return dfa->lexstate.lasttok = REPMN;
 
         case '|':
           if (syntax_bits & RE_LIMITED_OPS)
             goto normal_char;
           if (backslash != ((syntax_bits & RE_NO_BK_VBAR) == 0))
             goto normal_char;
-          laststart = true;
-          return lasttok = OR;
+          dfa->lexstate.laststart = true;
+          return dfa->lexstate.lasttok = OR;
 
         case '\n':
           if (syntax_bits & RE_LIMITED_OPS
               || backslash || !(syntax_bits & RE_NEWLINE_ALT))
             goto normal_char;
-          laststart = true;
-          return lasttok = OR;
+          dfa->lexstate.laststart = true;
+          return dfa->lexstate.lasttok = OR;
 
         case '(':
           if (backslash != ((syntax_bits & RE_NO_BK_PARENS) == 0))
             goto normal_char;
-          ++parens;
-          laststart = true;
-          return lasttok = LPAREN;
+          ++dfa->lexstate.parens;
+          dfa->lexstate.laststart = true;
+          return dfa->lexstate.lasttok = LPAREN;
 
         case ')':
           if (backslash != ((syntax_bits & RE_NO_BK_PARENS) == 0))
             goto normal_char;
-          if (parens == 0 && syntax_bits & RE_UNMATCHED_RIGHT_PAREN_ORD)
+          if (dfa->lexstate.parens == 0
+              && syntax_bits & RE_UNMATCHED_RIGHT_PAREN_ORD)
             goto normal_char;
-          --parens;
-          laststart = false;
-          return lasttok = RPAREN;
+          --dfa->lexstate.parens;
+          dfa->lexstate.laststart = false;
+          return dfa->lexstate.lasttok = RPAREN;
 
         case '.':
           if (backslash)
@@ -1452,8 +1475,8 @@ lex (struct dfa *dfa)
             {
               /* In multibyte environment period must match with a single
                  character not a byte.  So we use ANYCHAR.  */
-              laststart = false;
-              return lasttok = ANYCHAR;
+              dfa->lexstate.laststart = false;
+              return dfa->lexstate.lasttok = ANYCHAR;
             }
           zeroset (ccl);
           notset (ccl);
@@ -1461,8 +1484,8 @@ lex (struct dfa *dfa)
             clrbit ('\n', ccl);
           if (syntax_bits & RE_DOT_NOT_NULL)
             clrbit ('\0', ccl);
-          laststart = false;
-          return lasttok = CSET + dfa_charclass_index (dfa, ccl);
+          dfa->lexstate.laststart = false;
+          return dfa->lexstate.lasttok = CSET + dfa_charclass_index (dfa, ccl);
 
         case 's':
         case 'S':
@@ -1476,8 +1499,9 @@ lex (struct dfa *dfa)
                   setbit (c2, ccl);
               if (c == 'S')
                 notset (ccl);
-              laststart = false;
-              return lasttok = CSET + dfa_charclass_index (dfa, ccl);
+              dfa->lexstate.laststart = false;
+              return dfa->lexstate.lasttok = CSET + dfa_charclass_index (dfa,
+                                                                         ccl);
             }
 
           /* FIXME: see if optimizing this, as is done with ANYCHAR and
@@ -1488,12 +1512,12 @@ lex (struct dfa *dfa)
              strings, each minus its "already processed" '['.  */
           PUSH_LEX_STATE (c == 's' ? "[:space:]]" : "^[:space:]]");
 
-          lasttok = parse_bracket_exp (dfa);
+          dfa->lexstate.lasttok = parse_bracket_exp (dfa);
 
           POP_LEX_STATE ();
 
-          laststart = false;
-          return lasttok;
+          dfa->lexstate.laststart = false;
+          return dfa->lexstate.lasttok;
 
         case 'w':
         case 'W':
@@ -1508,8 +1532,9 @@ lex (struct dfa *dfa)
                   setbit (c2, ccl);
               if (c == 'W')
                 notset (ccl);
-              laststart = false;
-              return lasttok = CSET + dfa_charclass_index (dfa, ccl);
+              dfa->lexstate.laststart = false;
+              return dfa->lexstate.lasttok = CSET + dfa_charclass_index (dfa,
+                                                                         ccl);
             }
 
           /* FIXME: see if optimizing this, as is done with ANYCHAR and
@@ -1520,35 +1545,36 @@ lex (struct dfa *dfa)
              strings, each minus its "already processed" '['.  */
           PUSH_LEX_STATE (c == 'w' ? "_[:alnum:]]" : "^_[:alnum:]]");
 
-          lasttok = parse_bracket_exp (dfa);
+          dfa->lexstate.lasttok = parse_bracket_exp (dfa);
 
           POP_LEX_STATE ();
 
-          laststart = false;
-          return lasttok;
+          dfa->lexstate.laststart = false;
+          return dfa->lexstate.lasttok;
 
         case '[':
           if (backslash)
             goto normal_char;
-          laststart = false;
-          return lasttok = parse_bracket_exp (dfa);
+          dfa->lexstate.laststart = false;
+          return dfa->lexstate.lasttok = parse_bracket_exp (dfa);
 
         default:
         normal_char:
-          laststart = false;
+          dfa->lexstate.laststart = false;
           /* For multibyte character sets, folding is done in atom.  Always
              return WCHAR.  */
           if (dfa->multibyte)
-            return lasttok = WCHAR;
+            return dfa->lexstate.lasttok = WCHAR;
 
           if (case_fold && isalpha (c))
             {
               zeroset (ccl);
               setbit_case_fold_c (c, ccl);
-              return lasttok = CSET + dfa_charclass_index (dfa, ccl);
+              return dfa->lexstate.lasttok = CSET + dfa_charclass_index (dfa,
+                                                                         ccl);
             }
 
-          return lasttok = c;
+          return dfa->lexstate.lasttok = c;
         }
     }
 
@@ -1662,19 +1688,19 @@ addtok_wc (struct dfa *dfa, wint_t wc)
   size_t stored_bytes = wcrtomb ((char *) buf, wc, &s);
 
   if (stored_bytes != (size_t) -1)
-    cur_mb_len = stored_bytes;
+    dfa->lexstate.cur_mb_len = stored_bytes;
   else
     {
       /* This is merely stop-gap.  buf[0] is undefined, yet skipping
          the addtok_mb call altogether can corrupt the heap.  */
-      cur_mb_len = 1;
+      dfa->lexstate.cur_mb_len = 1;
       buf[0] = 0;
     }
 
-  addtok_mb (dfa, buf[0], cur_mb_len == 1 ? 3 : 1);
-  for (i = 1; i < cur_mb_len; i++)
+  addtok_mb (dfa, buf[0], dfa->lexstate.cur_mb_len == 1 ? 3 : 1);
+  for (i = 1; i < dfa->lexstate.cur_mb_len; i++)
     {
-      addtok_mb (dfa, buf[i], i == cur_mb_len - 1 ? 2 : 0);
+      addtok_mb (dfa, buf[i], i == dfa->lexstate.cur_mb_len - 1 ? 2 : 0);
       addtok (dfa, CAT);
     }
 }
@@ -1777,16 +1803,17 @@ atom (struct dfa *dfa)
 {
   if (tok == WCHAR)
     {
-      if (wctok == WEOF)
+      if (dfa->lexstate.wctok == WEOF)
         addtok (dfa, BACKREF);
       else
         {
-          addtok_wc (dfa, wctok);
+          addtok_wc (dfa, dfa->lexstate.wctok);
 
           if (case_fold)
             {
               wchar_t folded[CASE_FOLDED_BUFSIZE];
-              unsigned int i, n = case_folded_counterparts (wctok, folded);
+              unsigned int i, n = case_folded_counterparts (dfa->lexstate.wctok,
+                                                            folded);
               for (i = 0; i < n; i++)
                 {
                   addtok_wc (dfa, folded[i]);
@@ -1872,20 +1899,20 @@ closure (struct dfa *dfa)
 
   atom (dfa);
   while (tok == QMARK || tok == STAR || tok == PLUS || tok == REPMN)
-    if (tok == REPMN && (minrep || maxrep))
+    if (tok == REPMN && (dfa->lexstate.minrep || dfa->lexstate.maxrep))
       {
         ntokens = nsubtoks (dfa, dfa->tindex);
         tindex = dfa->tindex - ntokens;
-        if (maxrep < 0)
+        if (dfa->lexstate.maxrep < 0)
           addtok (dfa, PLUS);
-        if (minrep == 0)
+        if (dfa->lexstate.minrep == 0)
           addtok (dfa, QMARK);
-        for (i = 1; i < minrep; ++i)
+        for (i = 1; i < dfa->lexstate.minrep; ++i)
           {
             copytoks (dfa, tindex, ntokens);
             addtok (dfa, CAT);
           }
-        for (; i < maxrep; ++i)
+        for (; i < dfa->lexstate.maxrep; ++i)
           {
             copytoks (dfa, tindex, ntokens);
             addtok (dfa, QMARK);
@@ -1935,14 +1962,14 @@ regexp (struct dfa *dfa)
 static void
 dfaparse (char const *s, size_t len, struct dfa *d)
 {
-  lexptr = s;
-  lexleft = len;
-  lasttok = END;
-  laststart = true;
-  parens = 0;
+  d->lexstate.lexptr = s;
+  d->lexstate.lexleft = len;
+  d->lexstate.lasttok = END;
+  d->lexstate.laststart = true;
+  d->lexstate.parens = 0;
   if (d->multibyte)
     {
-      cur_mb_len = 0;
+      d->lexstate.cur_mb_len = 0;
       memset (&d->mbs, 0, sizeof d->mbs);
     }
 
@@ -3450,6 +3477,7 @@ dfainit (struct dfa *d)
   d->multibyte = MB_CUR_MAX > 1;
   d->dfaexec = d->multibyte ? dfaexec_mb : dfaexec_sb;
   d->fast = !d->multibyte;
+  d->lexstate.cur_mb_len = 1;
 }
 
 /* Return true if every construct in D is supported by this DFA matcher.  */
-- 
2.8.1





Information forwarded to bug-grep <at> gnu.org:
bug#24259; Package grep. (Thu, 18 Aug 2016 10:52:02 GMT) Full text and rfc822 format available.

Message #17 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Zev Weiss <zev <at> bewilderbeest.net>
To: bug-grep <at> gnu.org
Cc: Zev Weiss <zev <at> bewilderbeest.net>
Subject: [PATCH 5/6] dfa: thread-safety: eliminate static local variables
Date: Thu, 18 Aug 2016 05:50:18 -0500
* src/dfa.c: Replace utf8 and unibyte_c static local variables with
static globals initialized by a new function dfa_init() which must be
called before any other dfa*() functions.
(dfa_using_utf8): Rename using_utf8() to dfa_using_utf8() for
consistency with other exported functions.
* src/dfa.h (dfa_using_utf8): Rename using_utf8() to dfa_using_utf8();
also add _GL_ATTRIBUTE_PURE.
(dfa_init): New function.
* src/grep.c (main), tests/dfa-match-aux.c (main): Call dfa_init().
* src/dfasearch.c (EGexecute), src/kwsearch.c (Fexecute),
src/pcresearch.c (Pcompile): Replace using_utf8() with
dfa_using_utf8().
---
 src/dfa.c             | 62 +++++++++++++++++++++++++++------------------------
 src/dfa.h             |  5 ++++-
 src/dfasearch.c       |  2 +-
 src/grep.c            |  2 ++
 src/kwsearch.c        |  2 +-
 src/pcresearch.c      |  2 +-
 tests/dfa-match-aux.c |  2 ++
 7 files changed, 44 insertions(+), 33 deletions(-)

diff --git a/src/dfa.c b/src/dfa.c
index ae1b340..970b51f 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -756,6 +756,16 @@ char_context (struct dfa *dfa, unsigned char c)
   return CTX_NONE;
 }
 
+/* UTF-8 encoding allows some optimizations that we can't otherwise
+   assume in a multibyte encoding.  */
+static bool using_utf8;
+
+bool
+dfa_using_utf8 (void)
+{
+  return using_utf8;
+}
+
 /* Entry point to set syntax options.  */
 void
 dfasyntax (struct dfa *dfa, reg_syntax_t bits, bool fold, unsigned char eol)
@@ -788,7 +798,7 @@ dfasyntax (struct dfa *dfa, reg_syntax_t bits, bool fold, unsigned char eol)
 
       /* POSIX requires that the five bytes in "\n\r./" (including the
          terminating NUL) cannot occur inside a multibyte character.  */
-      dfa->syntax.never_trail[uc] = (using_utf8 () ? (uc & 0xc0) != 0x80
+      dfa->syntax.never_trail[uc] = (using_utf8 ? (uc & 0xc0) != 0x80
                                      : strchr ("\n\r./", uc) != NULL);
     }
 }
@@ -821,21 +831,21 @@ setbit_case_fold_c (int b, charclass c)
       setbit (i, c);
 }
 
+static void check_utf8 (void)
+{
+  wchar_t wc;
+  mbstate_t mbs = { 0 };
+  using_utf8 = mbrtowc (&wc, "\xc4\x80", 2, &mbs) == 2 && wc == 0x100;
+}
 
+static bool unibyte_c;
 
-/* UTF-8 encoding allows some optimizations that we can't otherwise
-   assume in a multibyte encoding.  */
-bool
-using_utf8 (void)
+static void check_unibyte_c (void)
 {
-  static int utf8 = -1;
-  if (utf8 < 0)
-    {
-      wchar_t wc;
-      mbstate_t mbs = { 0 };
-      utf8 = mbrtowc (&wc, "\xc4\x80", 2, &mbs) == 2 && wc == 0x100;
-    }
-  return utf8;
+  char const *locale = setlocale (LC_ALL, NULL);
+  unibyte_c = (!locale
+               || STREQ (locale, "C")
+               || STREQ (locale, "POSIX"));
 }
 
 /* The current locale is known to be a unibyte locale
@@ -862,20 +872,7 @@ using_simple_locale (struct dfa *dfa)
      && '}' == 125 && '~' == 126)
   };
 
-  if (! native_c_charset || dfa->multibyte)
-    return false;
-  else
-    {
-      static int unibyte_c = -1;
-      if (unibyte_c < 0)
-        {
-          char const *locale = setlocale (LC_ALL, NULL);
-          unibyte_c = (!locale
-                       || STREQ (locale, "C")
-                       || STREQ (locale, "POSIX"));
-        }
-      return unibyte_c;
-    }
+  return (!native_c_charset || dfa->multibyte) ? false : unibyte_c;
 }
 
 /* Fetch the next lexical input character.  Set C (of type int) to the
@@ -1842,7 +1839,7 @@ atom (struct dfa *dfa)
 
       dfa->parsestate.tok = lex (dfa);
     }
-  else if (dfa->parsestate.tok == ANYCHAR && using_utf8 ())
+  else if (dfa->parsestate.tok == ANYCHAR && using_utf8)
     {
       /* For UTF-8 expand the period to a series of CSETs that define a valid
          UTF-8 character.  This avoids using the slow multibyte path.  I'm
@@ -3523,7 +3520,7 @@ dfaoptimize (struct dfa *d)
   size_t i;
   bool have_backref = false;
 
-  if (!using_utf8 ())
+  if (!using_utf8)
     return;
 
   for (i = 0; i < d->tindex; ++i)
@@ -4201,4 +4198,11 @@ dfaalloc (void)
   return d;
 }
 
+void
+dfa_init (void)
+{
+  check_utf8 ();
+  check_unibyte_c ();
+}
+
 /* vim:set shiftwidth=2: */
diff --git a/src/dfa.h b/src/dfa.h
index 014ae96..585390a 100644
--- a/src/dfa.h
+++ b/src/dfa.h
@@ -100,4 +100,7 @@ extern void dfawarn (const char *);
    The user must supply a dfaerror.  */
 extern _Noreturn void dfaerror (const char *);
 
-extern bool using_utf8 (void);
+extern bool dfa_using_utf8 (void) _GL_ATTRIBUTE_PURE;
+
+/* This must be called before calling any of the above dfa*() functions. */
+extern void dfa_init (void);
diff --git a/src/dfasearch.c b/src/dfasearch.c
index 3dbf76b..10c4f51 100644
--- a/src/dfasearch.c
+++ b/src/dfasearch.c
@@ -277,7 +277,7 @@ EGexecute (char *buf, size_t size, size_t *match_size,
 
               if (exact_kwset_match)
                 {
-                  if (MB_CUR_MAX == 1 || using_utf8 ())
+                  if (MB_CUR_MAX == 1 || dfa_using_utf8 ())
                     goto success;
                   if (mb_start < beg)
                     mb_start = beg;
diff --git a/src/grep.c b/src/grep.c
index a82da61..bd1c5cc 100644
--- a/src/grep.c
+++ b/src/grep.c
@@ -2351,6 +2351,8 @@ main (int argc, char **argv)
   textdomain (PACKAGE);
 #endif
 
+  dfa_init ();
+
   atexit (clean_up_stdout);
 
   last_recursive = 0;
diff --git a/src/kwsearch.c b/src/kwsearch.c
index d2afa40..fb77280 100644
--- a/src/kwsearch.c
+++ b/src/kwsearch.c
@@ -93,7 +93,7 @@ Fexecute (char *buf, size_t size, size_t *match_size,
     mb_check = longest = false;
   else
     {
-      mb_check = MB_CUR_MAX > 1 && !using_utf8 ();
+      mb_check = MB_CUR_MAX > 1 && !dfa_using_utf8 ();
       longest = mb_check || start_ptr || match_words;
     }
 
diff --git a/src/pcresearch.c b/src/pcresearch.c
index f6e72b0..3f76603 100644
--- a/src/pcresearch.c
+++ b/src/pcresearch.c
@@ -114,7 +114,7 @@ Pcompile (char const *pattern, size_t size)
 
   if (1 < MB_CUR_MAX)
     {
-      if (! using_utf8 ())
+      if (! dfa_using_utf8 ())
         error (EXIT_TROUBLE, 0,
                _("-P supports only unibyte and UTF-8 locales"));
       multibyte_locale = true;
diff --git a/tests/dfa-match-aux.c b/tests/dfa-match-aux.c
index 25b0535..e651735 100644
--- a/tests/dfa-match-aux.c
+++ b/tests/dfa-match-aux.c
@@ -54,6 +54,8 @@ main (int argc, char **argv)
 
   setlocale (LC_ALL, "");
 
+  dfa_init ();
+
   dfa = dfaalloc ();
   dfasyntax (dfa, RE_SYNTAX_GREP | RE_NO_EMPTY_RANGES, 0, '\n');
   dfacomp (argv[1], strlen (argv[1]), dfa, 0);
-- 
2.8.1





Information forwarded to bug-grep <at> gnu.org:
bug#24259; Package grep. (Thu, 18 Aug 2016 10:52:03 GMT) Full text and rfc822 format available.

Message #20 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Zev Weiss <zev <at> bewilderbeest.net>
To: bug-grep <at> gnu.org
Cc: Zev Weiss <zev <at> bewilderbeest.net>
Subject: [PATCH 4/6] dfa: thread-safety: move regex syntax configuration into
 struct dfa
Date: Thu, 18 Aug 2016 05:50:17 -0500
* src/dfa.c: move global variables holding regex syntax configuration
into a new struct (`struct regex_syntax') and add an instance of it to
struct dfa.  All references to the globals are replaced with
references to the dfa struct's new member.  As a side effect, a
`struct dfa' must be allocated with dfaalloc() and passed to
dfasyntax().
* src/dfa.h (dfasyntax): Add new struct dfa* parameter.
* src/dfasearch.c (GEAcompile): Allocate `dfa' earlier and pass it to
dfasyntax().
* tests/dfa-match-aux.c (main): Pass `dfa' to dfasyntax().
---
 src/dfa.c             | 244 +++++++++++++++++++++++++-------------------------
 src/dfa.h             |   8 +-
 src/dfasearch.c       |   5 +-
 tests/dfa-match-aux.c |   2 +-
 4 files changed, 132 insertions(+), 127 deletions(-)

diff --git a/src/dfa.c b/src/dfa.c
index 858bc55..ae1b340 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -328,6 +328,32 @@ struct mb_char_classes
   size_t nchars;
 };
 
+struct regex_syntax
+{
+  /* Syntax bits controlling the behavior of the lexical analyzer.  */
+  reg_syntax_t syntax_bits;
+  bool syntax_bits_set;
+
+  /* Flag for case-folding letters into sets.  */
+  bool case_fold;
+
+  /* End-of-line byte in data.  */
+  unsigned char eolbyte;
+
+  /* Cache of char-context values.  */
+  int sbit[NOTCHAR];
+
+  /* If never_trail[B], the byte B cannot be a non-initial byte in a
+     multibyte character.  */
+  bool never_trail[NOTCHAR];
+
+  /* Set of characters considered letters.  */
+  charclass letters;
+
+  /* Set of characters that are newline.  */
+  charclass newline;
+};
+
 /* Lexical analyzer.  All the dross that deals with the obnoxious
    GNU Regex syntax bits is located here.  The poor, suffering
    reader is referred to the GNU Regex documentation for the
@@ -366,6 +392,9 @@ struct parser_state
 /* A compiled regular expression.  */
 struct dfa
 {
+  /* Syntax configuration */
+  struct regex_syntax syntax;
+
   /* Fields filled by the scanner.  */
   charclass *charclasses;       /* Array of character sets for CSET tokens.  */
   size_t cindex;                /* Index for adding new charclasses.  */
@@ -711,29 +740,6 @@ dfa_charclass_index (struct dfa *d, charclass const s)
   return i;
 }
 
-/* Syntax bits controlling the behavior of the lexical analyzer.  */
-static reg_syntax_t syntax_bits;
-static bool syntax_bits_set;
-
-/* Flag for case-folding letters into sets.  */
-static bool case_fold;
-
-/* End-of-line byte in data.  */
-static unsigned char eolbyte;
-
-/* Cache of char-context values.  */
-static int sbit[NOTCHAR];
-
-/* If never_trail[B], the byte B cannot be a non-initial byte in a
-   multibyte character.  */
-static bool never_trail[NOTCHAR];
-
-/* Set of characters considered letters.  */
-static charclass letters;
-
-/* Set of characters that are newline.  */
-static charclass newline;
-
 static bool
 unibyte_word_constituent (unsigned char c)
 {
@@ -741,9 +747,9 @@ unibyte_word_constituent (unsigned char c)
 }
 
 static int
-char_context (unsigned char c)
+char_context (struct dfa *dfa, unsigned char c)
 {
-  if (c == eolbyte)
+  if (c == dfa->syntax.eolbyte)
     return CTX_NEWLINE;
   if (unibyte_word_constituent (c))
     return CTX_LETTER;
@@ -752,13 +758,13 @@ char_context (unsigned char c)
 
 /* Entry point to set syntax options.  */
 void
-dfasyntax (reg_syntax_t bits, bool fold, unsigned char eol)
+dfasyntax (struct dfa *dfa, reg_syntax_t bits, bool fold, unsigned char eol)
 {
   int i;
-  syntax_bits_set = true;
-  syntax_bits = bits;
-  case_fold = fold;
-  eolbyte = eol;
+  dfa->syntax.syntax_bits_set = true;
+  dfa->syntax.syntax_bits = bits;
+  dfa->syntax.case_fold = fold;
+  dfa->syntax.eolbyte = eol;
 
   for (i = CHAR_MIN; i <= CHAR_MAX; ++i)
     {
@@ -769,21 +775,21 @@ dfasyntax (reg_syntax_t bits, bool fold, unsigned char eol)
       mbrtowc_cache[uc] = mbrtowc (&wc, &c, 1, &s) <= 1 ? wc : WEOF;
 
       /* Now that mbrtowc_cache[uc] is set, use it to calculate sbit.  */
-      sbit[uc] = char_context (uc);
-      switch (sbit[uc])
+      dfa->syntax.sbit[uc] = char_context (dfa, uc);
+      switch (dfa->syntax.sbit[uc])
         {
         case CTX_LETTER:
-          setbit (uc, letters);
+          setbit (uc, dfa->syntax.letters);
           break;
         case CTX_NEWLINE:
-          setbit (uc, newline);
+          setbit (uc, dfa->syntax.newline);
           break;
         }
 
       /* POSIX requires that the five bytes in "\n\r./" (including the
          terminating NUL) cannot occur inside a multibyte character.  */
-      never_trail[uc] = (using_utf8 () ? (uc & 0xc0) != 0x80
-                         : strchr ("\n\r./", uc) != NULL);
+      dfa->syntax.never_trail[uc] = (using_utf8 () ? (uc & 0xc0) != 0x80
+                                     : strchr ("\n\r./", uc) != NULL);
     }
 }
 
@@ -1062,7 +1068,7 @@ parse_bracket_exp (struct dfa *dfa)
         {
           FETCH_WC (dfa, c1, wc1, _("unbalanced ["));
 
-          if ((c1 == ':' && (syntax_bits & RE_CHAR_CLASSES))
+          if ((c1 == ':' && (dfa->syntax.syntax_bits & RE_CHAR_CLASSES))
               || c1 == '.' || c1 == '=')
             {
               enum { MAX_BRACKET_STRING_LEN = 32 };
@@ -1091,8 +1097,9 @@ parse_bracket_exp (struct dfa *dfa)
                    worry about that possibility.  */
                 {
                   char const *class
-                    = (case_fold && (STREQ (str, "upper")
-                                     || STREQ (str, "lower")) ? "alpha" : str);
+                    = (dfa->syntax.case_fold && (STREQ (str, "upper")
+                                                 || STREQ (str, "lower")) ?
+                                                      "alpha" : str);
                   const struct dfa_ctype *pred = find_pred (class);
                   if (!pred)
                     dfaerror (_("invalid character class"));
@@ -1118,7 +1125,7 @@ parse_bracket_exp (struct dfa *dfa)
              are already set up.  */
         }
 
-      if (c == '\\' && (syntax_bits & RE_BACKSLASH_ESCAPE_IN_LISTS))
+      if (c == '\\' && (dfa->syntax.syntax_bits & RE_BACKSLASH_ESCAPE_IN_LISTS))
         FETCH_WC (dfa, c, wc, _("unbalanced ["));
 
       if (c1 == NOTCHAR)
@@ -1147,7 +1154,8 @@ parse_bracket_exp (struct dfa *dfa)
             }
           else
             {
-              if (c2 == '\\' && (syntax_bits & RE_BACKSLASH_ESCAPE_IN_LISTS))
+              if (c2 == '\\' && (dfa->syntax.syntax_bits
+                                 & RE_BACKSLASH_ESCAPE_IN_LISTS))
                 FETCH_WC (dfa, c2, wc2, _("unbalanced ["));
 
               colon_warning_state |= 8;
@@ -1163,7 +1171,7 @@ parse_bracket_exp (struct dfa *dfa)
                       int ci;
                       for (ci = c; ci <= c2; ci++)
                         setbit (ci, ccl);
-                      if (case_fold)
+                      if (dfa->syntax.case_fold)
                         {
                           int uc = toupper (c);
                           int uc2 = toupper (c2);
@@ -1187,7 +1195,7 @@ parse_bracket_exp (struct dfa *dfa)
 
       if (!dfa->multibyte)
         {
-          if (case_fold)
+          if (dfa->syntax.case_fold)
             setbit_case_fold_c (c, ccl);
           else
             setbit (c, ccl);
@@ -1200,7 +1208,7 @@ parse_bracket_exp (struct dfa *dfa)
         {
           wchar_t folded[CASE_FOLDED_BUFSIZE + 1];
           unsigned int i;
-          unsigned int n = (case_fold
+          unsigned int n = (dfa->syntax.case_fold
                             ? case_folded_counterparts (wc, folded + 1) + 1
                             : 1);
           folded[0] = wc;
@@ -1233,7 +1241,7 @@ parse_bracket_exp (struct dfa *dfa)
     {
       assert (!dfa->multibyte);
       notset (ccl);
-      if (syntax_bits & RE_HAT_LISTS_NOT_NEWLINE)
+      if (dfa->syntax.syntax_bits & RE_HAT_LISTS_NOT_NEWLINE)
         clrbit ('\n', ccl);
     }
 
@@ -1285,7 +1293,7 @@ lex (struct dfa *dfa)
         case '^':
           if (backslash)
             goto normal_char;
-          if (syntax_bits & RE_CONTEXT_INDEP_ANCHORS
+          if (dfa->syntax.syntax_bits & RE_CONTEXT_INDEP_ANCHORS
               || dfa->lexstate.lasttok == END || dfa->lexstate.lasttok == LPAREN
               || dfa->lexstate.lasttok == OR)
             return dfa->lexstate.lasttok = BEGLINE;
@@ -1294,17 +1302,17 @@ lex (struct dfa *dfa)
         case '$':
           if (backslash)
             goto normal_char;
-          if (syntax_bits & RE_CONTEXT_INDEP_ANCHORS
+          if (dfa->syntax.syntax_bits & RE_CONTEXT_INDEP_ANCHORS
               || dfa->lexstate.lexleft == 0
-              || (syntax_bits & RE_NO_BK_PARENS
+              || (dfa->syntax.syntax_bits & RE_NO_BK_PARENS
                   ? dfa->lexstate.lexleft > 0 && *dfa->lexstate.lexptr == ')'
                   : dfa->lexstate.lexleft > 1 && dfa->lexstate.lexptr[0] == '\\'
                     && dfa->lexstate.lexptr[1] == ')')
-              || (syntax_bits & RE_NO_BK_VBAR
+              || (dfa->syntax.syntax_bits & RE_NO_BK_VBAR
                   ? dfa->lexstate.lexleft > 0 && *dfa->lexstate.lexptr == '|'
                   : dfa->lexstate.lexleft > 1 && dfa->lexstate.lexptr[0] == '\\'
                     && dfa->lexstate.lexptr[1] == '|')
-              || ((syntax_bits & RE_NEWLINE_ALT)
+              || ((dfa->syntax.syntax_bits & RE_NEWLINE_ALT)
                   && dfa->lexstate.lexleft > 0
                   && *dfa->lexstate.lexptr == '\n'))
             return dfa->lexstate.lasttok = ENDLINE;
@@ -1319,7 +1327,7 @@ lex (struct dfa *dfa)
         case '7':
         case '8':
         case '9':
-          if (backslash && !(syntax_bits & RE_NO_BK_REFS))
+          if (backslash && !(dfa->syntax.syntax_bits & RE_NO_BK_REFS))
             {
               dfa->lexstate.laststart = false;
               return dfa->lexstate.lasttok = BACKREF;
@@ -1327,7 +1335,7 @@ lex (struct dfa *dfa)
           goto normal_char;
 
         case '`':
-          if (backslash && !(syntax_bits & RE_NO_GNU_OPS))
+          if (backslash && !(dfa->syntax.syntax_bits & RE_NO_GNU_OPS))
             {
               /* FIXME: should be beginning of string */
               return dfa->lexstate.lasttok = BEGLINE;
@@ -1335,7 +1343,7 @@ lex (struct dfa *dfa)
           goto normal_char;
 
         case '\'':
-          if (backslash && !(syntax_bits & RE_NO_GNU_OPS))
+          if (backslash && !(dfa->syntax.syntax_bits & RE_NO_GNU_OPS))
             {
               /* FIXME: should be end of string */
               return dfa->lexstate.lasttok = ENDLINE;
@@ -1343,56 +1351,60 @@ lex (struct dfa *dfa)
           goto normal_char;
 
         case '<':
-          if (backslash && !(syntax_bits & RE_NO_GNU_OPS))
+          if (backslash && !(dfa->syntax.syntax_bits & RE_NO_GNU_OPS))
             return dfa->lexstate.lasttok = BEGWORD;
           goto normal_char;
 
         case '>':
-          if (backslash && !(syntax_bits & RE_NO_GNU_OPS))
+          if (backslash && !(dfa->syntax.syntax_bits & RE_NO_GNU_OPS))
             return dfa->lexstate.lasttok = ENDWORD;
           goto normal_char;
 
         case 'b':
-          if (backslash && !(syntax_bits & RE_NO_GNU_OPS))
+          if (backslash && !(dfa->syntax.syntax_bits & RE_NO_GNU_OPS))
             return dfa->lexstate.lasttok = LIMWORD;
           goto normal_char;
 
         case 'B':
-          if (backslash && !(syntax_bits & RE_NO_GNU_OPS))
+          if (backslash && !(dfa->syntax.syntax_bits & RE_NO_GNU_OPS))
             return dfa->lexstate.lasttok = NOTLIMWORD;
           goto normal_char;
 
         case '?':
-          if (syntax_bits & RE_LIMITED_OPS)
+          if (dfa->syntax.syntax_bits & RE_LIMITED_OPS)
             goto normal_char;
-          if (backslash != ((syntax_bits & RE_BK_PLUS_QM) != 0))
+          if (backslash != ((dfa->syntax.syntax_bits & RE_BK_PLUS_QM) != 0))
             goto normal_char;
-          if (!(syntax_bits & RE_CONTEXT_INDEP_OPS) && dfa->lexstate.laststart)
+          if (!(dfa->syntax.syntax_bits & RE_CONTEXT_INDEP_OPS)
+              && dfa->lexstate.laststart)
             goto normal_char;
           return dfa->lexstate.lasttok = QMARK;
 
         case '*':
           if (backslash)
             goto normal_char;
-          if (!(syntax_bits & RE_CONTEXT_INDEP_OPS) && dfa->lexstate.laststart)
+          if (!(dfa->syntax.syntax_bits & RE_CONTEXT_INDEP_OPS)
+              && dfa->lexstate.laststart)
             goto normal_char;
           return dfa->lexstate.lasttok = STAR;
 
         case '+':
-          if (syntax_bits & RE_LIMITED_OPS)
+          if (dfa->syntax.syntax_bits & RE_LIMITED_OPS)
             goto normal_char;
-          if (backslash != ((syntax_bits & RE_BK_PLUS_QM) != 0))
+          if (backslash != ((dfa->syntax.syntax_bits & RE_BK_PLUS_QM) != 0))
             goto normal_char;
-          if (!(syntax_bits & RE_CONTEXT_INDEP_OPS) && dfa->lexstate.laststart)
+          if (!(dfa->syntax.syntax_bits & RE_CONTEXT_INDEP_OPS)
+              && dfa->lexstate.laststart)
             goto normal_char;
           return dfa->lexstate.lasttok = PLUS;
 
         case '{':
-          if (!(syntax_bits & RE_INTERVALS))
+          if (!(dfa->syntax.syntax_bits & RE_INTERVALS))
             goto normal_char;
-          if (backslash != ((syntax_bits & RE_NO_BK_BRACES) == 0))
+          if (backslash != ((dfa->syntax.syntax_bits & RE_NO_BK_BRACES) == 0))
             goto normal_char;
-          if (!(syntax_bits & RE_CONTEXT_INDEP_OPS) && dfa->lexstate.laststart)
+          if (!(dfa->syntax.syntax_bits & RE_CONTEXT_INDEP_OPS)
+              && dfa->lexstate.laststart)
             goto normal_char;
 
           /* Cases:
@@ -1439,7 +1451,7 @@ lex (struct dfa *dfa)
                    && (dfa->lexstate.maxrep < 0
                        || dfa->lexstate.minrep <= dfa->lexstate.maxrep)))
               {
-                if (syntax_bits & RE_INVALID_INTERVAL_ORD)
+                if (dfa->syntax.syntax_bits & RE_INVALID_INTERVAL_ORD)
                   goto normal_char;
                 dfaerror (_("invalid content of \\{\\}"));
               }
@@ -1452,32 +1464,32 @@ lex (struct dfa *dfa)
           return dfa->lexstate.lasttok = REPMN;
 
         case '|':
-          if (syntax_bits & RE_LIMITED_OPS)
+          if (dfa->syntax.syntax_bits & RE_LIMITED_OPS)
             goto normal_char;
-          if (backslash != ((syntax_bits & RE_NO_BK_VBAR) == 0))
+          if (backslash != ((dfa->syntax.syntax_bits & RE_NO_BK_VBAR) == 0))
             goto normal_char;
           dfa->lexstate.laststart = true;
           return dfa->lexstate.lasttok = OR;
 
         case '\n':
-          if (syntax_bits & RE_LIMITED_OPS
-              || backslash || !(syntax_bits & RE_NEWLINE_ALT))
+          if (dfa->syntax.syntax_bits & RE_LIMITED_OPS
+              || backslash || !(dfa->syntax.syntax_bits & RE_NEWLINE_ALT))
             goto normal_char;
           dfa->lexstate.laststart = true;
           return dfa->lexstate.lasttok = OR;
 
         case '(':
-          if (backslash != ((syntax_bits & RE_NO_BK_PARENS) == 0))
+          if (backslash != ((dfa->syntax.syntax_bits & RE_NO_BK_PARENS) == 0))
             goto normal_char;
           ++dfa->lexstate.parens;
           dfa->lexstate.laststart = true;
           return dfa->lexstate.lasttok = LPAREN;
 
         case ')':
-          if (backslash != ((syntax_bits & RE_NO_BK_PARENS) == 0))
+          if (backslash != ((dfa->syntax.syntax_bits & RE_NO_BK_PARENS) == 0))
             goto normal_char;
           if (dfa->lexstate.parens == 0
-              && syntax_bits & RE_UNMATCHED_RIGHT_PAREN_ORD)
+              && dfa->syntax.syntax_bits & RE_UNMATCHED_RIGHT_PAREN_ORD)
             goto normal_char;
           --dfa->lexstate.parens;
           dfa->lexstate.laststart = false;
@@ -1495,16 +1507,16 @@ lex (struct dfa *dfa)
             }
           zeroset (ccl);
           notset (ccl);
-          if (!(syntax_bits & RE_DOT_NEWLINE))
+          if (!(dfa->syntax.syntax_bits & RE_DOT_NEWLINE))
             clrbit ('\n', ccl);
-          if (syntax_bits & RE_DOT_NOT_NULL)
+          if (dfa->syntax.syntax_bits & RE_DOT_NOT_NULL)
             clrbit ('\0', ccl);
           dfa->lexstate.laststart = false;
           return dfa->lexstate.lasttok = CSET + dfa_charclass_index (dfa, ccl);
 
         case 's':
         case 'S':
-          if (!backslash || (syntax_bits & RE_NO_GNU_OPS))
+          if (!backslash || (dfa->syntax.syntax_bits & RE_NO_GNU_OPS))
             goto normal_char;
           if (!dfa->multibyte)
             {
@@ -1536,7 +1548,7 @@ lex (struct dfa *dfa)
 
         case 'w':
         case 'W':
-          if (!backslash || (syntax_bits & RE_NO_GNU_OPS))
+          if (!backslash || (dfa->syntax.syntax_bits & RE_NO_GNU_OPS))
             goto normal_char;
 
           if (!dfa->multibyte)
@@ -1581,7 +1593,7 @@ lex (struct dfa *dfa)
           if (dfa->multibyte)
             return dfa->lexstate.lasttok = WCHAR;
 
-          if (case_fold && isalpha (c))
+          if (dfa->syntax.case_fold && isalpha (c))
             {
               zeroset (ccl);
               setbit_case_fold_c (c, ccl);
@@ -1741,9 +1753,9 @@ add_utf8_anychar (struct dfa *dfa)
         copyset (utf8_classes[i], c);
         if (i == 1)
           {
-            if (!(syntax_bits & RE_DOT_NEWLINE))
+            if (!(dfa->syntax.syntax_bits & RE_DOT_NEWLINE))
               clrbit ('\n', c);
-            if (syntax_bits & RE_DOT_NOT_NULL)
+            if (dfa->syntax.syntax_bits & RE_DOT_NOT_NULL)
               clrbit ('\0', c);
           }
         dfa->utf8_anychar_classes[i] = CSET + dfa_charclass_index (dfa, c);
@@ -1815,7 +1827,7 @@ atom (struct dfa *dfa)
         {
           addtok_wc (dfa, dfa->lexstate.wctok);
 
-          if (case_fold)
+          if (dfa->syntax.case_fold)
             {
               wchar_t folded[CASE_FOLDED_BUFSIZE];
               unsigned int i, n = case_folded_counterparts (dfa->lexstate.wctok,
@@ -1985,7 +1997,7 @@ dfaparse (char const *s, size_t len, struct dfa *d)
       memset (&d->mbs, 0, sizeof d->mbs);
     }
 
-  if (!syntax_bits_set)
+  if (!d->syntax.syntax_bits_set)
     dfaerror (_("no syntax specified"));
 
   d->parsestate.tok = lex (d);
@@ -2271,19 +2283,19 @@ epsclosure (position_set *s, struct dfa const *d, char *visited)
    character included in C.  */
 
 static int
-charclass_context (charclass c)
+charclass_context (struct dfa *dfa, charclass c)
 {
   int context = 0;
   unsigned int j;
 
-  if (tstbit (eolbyte, c))
+  if (tstbit (dfa->syntax.eolbyte, c))
     context |= CTX_NEWLINE;
 
   for (j = 0; j < CHARCLASS_WORDS; ++j)
     {
-      if (c[j] & letters[j])
+      if (c[j] & dfa->syntax.letters[j])
         context |= CTX_LETTER;
-      if (c[j] & ~(letters[j] | newline[j]))
+      if (c[j] & ~(dfa->syntax.letters[j] | dfa->syntax.newline[j]))
         context |= CTX_NONE;
     }
 
@@ -2678,15 +2690,15 @@ dfastate (state_num s, struct dfa *d, state_num trans[])
           if (!SUCCEEDS_IN_CONTEXT (pos.constraint,
                                     d->states[s].context, CTX_NEWLINE))
             for (j = 0; j < CHARCLASS_WORDS; ++j)
-              matches[j] &= ~newline[j];
+              matches[j] &= ~d->syntax.newline[j];
           if (!SUCCEEDS_IN_CONTEXT (pos.constraint,
                                     d->states[s].context, CTX_LETTER))
             for (j = 0; j < CHARCLASS_WORDS; ++j)
-              matches[j] &= ~letters[j];
+              matches[j] &= ~d->syntax.letters[j];
           if (!SUCCEEDS_IN_CONTEXT (pos.constraint,
                                     d->states[s].context, CTX_NONE))
             for (j = 0; j < CHARCLASS_WORDS; ++j)
-              matches[j] &= letters[j] | newline[j];
+              matches[j] &= d->syntax.letters[j] | d->syntax.newline[j];
 
           /* If there are no characters left, there's no point in going on.  */
           for (j = 0; j < CHARCLASS_WORDS && !matches[j]; ++j)
@@ -2792,7 +2804,7 @@ dfastate (state_num s, struct dfa *d, state_num trans[])
 
       for (i = 0; i < NOTCHAR; ++i)
         trans[i] = unibyte_word_constituent (i) ? state_letter : state;
-      trans[eolbyte] = state_newline;
+      trans[d->syntax.eolbyte] = state_newline;
     }
   else
     for (i = 0; i < NOTCHAR; ++i)
@@ -2848,7 +2860,7 @@ dfastate (state_num s, struct dfa *d, state_num trans[])
         }
 
       /* Find out if the new state will want any context information.  */
-      possible_contexts = charclass_context (labels[i]);
+      possible_contexts = charclass_context (d, labels[i]);
       separate_contexts = state_separate_contexts (&follows);
 
       /* Find the state(s) corresponding to the union of the follows.  */
@@ -2895,7 +2907,7 @@ dfastate (state_num s, struct dfa *d, state_num trans[])
             {
               int c = j * CHARCLASS_WORD_BITS + k;
 
-              if (c == eolbyte)
+              if (c == d->syntax.eolbyte)
                 trans[c] = state_newline;
               else if (unibyte_word_constituent (c))
                 trans[c] = state_letter;
@@ -3021,8 +3033,8 @@ build_state (state_num s, struct dfa *d)
 
   /* Keep the newline transition in a special place so we can use it as
      a sentinel.  */
-  d->newlines[s] = trans[eolbyte];
-  trans[eolbyte] = -1;
+  d->newlines[s] = trans[d->syntax.eolbyte];
+  trans[d->syntax.eolbyte] = -1;
 
   if (ACCEPTING (s, *d))
     d->fails[s] = trans;
@@ -3041,7 +3053,7 @@ transit_state_singlebyte (struct dfa *d, state_num s, unsigned char const **pp)
 {
   state_num *t;
 
-  if (**pp == eolbyte)
+  if (**pp == d->syntax.eolbyte)
     {
       /* S is always an initial state in transit_state, so the
          transition table for the state must have been built already.  */
@@ -3084,7 +3096,7 @@ transit_state (struct dfa *d, state_num s, unsigned char const **pp,
   size_t i, j;
 
   int mbclen = mbs_to_wchar (&wc, (char const *) *pp, end - *pp, d);
-  int context = wc == eolbyte ? CTX_NEWLINE : CTX_NONE;
+  int context = wc == d->syntax.eolbyte ? CTX_NEWLINE : CTX_NONE;
   bool context_newline = context == CTX_NEWLINE;
 
   /* This state has some operators which can match a multibyte character.  */
@@ -3202,7 +3214,7 @@ skip_remains_mb (struct dfa *d, unsigned char const *p,
                  unsigned char const *mbp, char const *end, wint_t *wcp)
 {
   wint_t wc = WEOF;
-  if (never_trail[*p])
+  if (d->syntax.never_trail[*p])
     return p;
   while (mbp < p)
     mbp += mbs_to_wchar (&wc, (char const *) mbp,
@@ -3240,7 +3252,7 @@ dfaexec_main (struct dfa *d, char const *begin, char *end, bool allow_nl,
   unsigned char const *p, *mbp; /* Current input character.  */
   state_num **trans, *t;        /* Copy of d->trans so it can be optimized
                                    into a register.  */
-  unsigned char eol = eolbyte;  /* Likewise for eolbyte.  */
+  unsigned char eol = d->syntax.eolbyte;  /* Likewise for eolbyte.  */
   unsigned char saved_end;
   size_t nlcount = 0;
 
@@ -3307,8 +3319,8 @@ dfaexec_main (struct dfa *d, char const *begin, char *end, bool allow_nl,
                 }
 
               if (d->states[s].mbps.nelem == 0 || (*p == eol && !allow_nl)
-                  || (*p == '\n' && !(syntax_bits & RE_DOT_NEWLINE))
-                  || (*p == '\0' && (syntax_bits & RE_DOT_NOT_NULL))
+                  || (*p == '\n' && !(d->syntax.syntax_bits & RE_DOT_NEWLINE))
+                  || (*p == '\0' && (d->syntax.syntax_bits & RE_DOT_NOT_NULL))
                   || (char *) p >= end)
                 {
                   /* If an input character does not match ANYCHAR, do it
@@ -3371,14 +3383,14 @@ dfaexec_main (struct dfa *d, char const *begin, char *end, bool allow_nl,
         }
       else if (d->fails[s])
         {
-          if (d->success[s] & sbit[*p])
+          if (d->success[s] & d->syntax.sbit[*p])
             goto done;
 
           s1 = s;
           if (!multibyte || d->states[s].mbps.nelem == 0
               || (*p == eol && !allow_nl)
-              || (*p == '\n' && !(syntax_bits & RE_DOT_NEWLINE))
-              || (*p == '\0' && (syntax_bits & RE_DOT_NOT_NULL))
+              || (*p == '\n' && !(d->syntax.syntax_bits & RE_DOT_NEWLINE))
+              || (*p == '\0' && (d->syntax.syntax_bits & RE_DOT_NOT_NULL))
               || (char *) p >= end)
             {
               /* If a input character does not match ANYCHAR, do it
@@ -3480,18 +3492,6 @@ free_mbdata (struct dfa *d)
     }
 }
 
-/* Initialize the components of a dfa that the other routines don't
-   initialize for themselves.  */
-static void
-dfainit (struct dfa *d)
-{
-  memset (d, 0, sizeof *d);
-  d->multibyte = MB_CUR_MAX > 1;
-  d->dfaexec = d->multibyte ? dfaexec_mb : dfaexec_sb;
-  d->fast = !d->multibyte;
-  d->lexstate.cur_mb_len = 1;
-}
-
 /* Return true if every construct in D is supported by this DFA matcher.  */
 static bool _GL_ATTRIBUTE_PURE
 dfa_supported (struct dfa const *d)
@@ -3642,7 +3642,6 @@ dfassbuild (struct dfa *d)
 void
 dfacomp (char const *s, size_t len, struct dfa *d, bool searchflag)
 {
-  dfainit (d);
   dfaparse (s, len, d);
   dfassbuild (d);
 
@@ -3958,7 +3957,7 @@ dfamust (struct dfa const *d)
   bool endline = false;
   bool need_begline = false;
   bool need_endline = false;
-  bool case_fold_unibyte = case_fold && MB_CUR_MAX == 1;
+  bool case_fold_unibyte = d->syntax.case_fold && MB_CUR_MAX == 1;
 
   for (ri = 0; ri < d->tindex; ++ri)
     {
@@ -4194,7 +4193,12 @@ dfamustfree (struct dfamust *dm)
 struct dfa *
 dfaalloc (void)
 {
-  return xmalloc (sizeof (struct dfa));
+  struct dfa *d = xcalloc (1, sizeof (struct dfa));
+  d->multibyte = MB_CUR_MAX > 1;
+  d->dfaexec = d->multibyte ? dfaexec_mb : dfaexec_sb;
+  d->fast = !d->multibyte;
+  d->lexstate.cur_mb_len = 1;
+  return d;
 }
 
 /* vim:set shiftwidth=2: */
diff --git a/src/dfa.h b/src/dfa.h
index 60da0e4..014ae96 100644
--- a/src/dfa.h
+++ b/src/dfa.h
@@ -50,10 +50,10 @@ extern struct dfamust *dfamust (struct dfa const *);
 /* Free the storage held by the components of a struct dfamust. */
 extern void dfamustfree (struct dfamust *);
 
-/* dfasyntax() takes three arguments; the first sets the syntax bits described
-   earlier in this file, the second sets the case-folding flag, and the
-   third specifies the line terminator. */
-extern void dfasyntax (reg_syntax_t, bool, unsigned char);
+/* dfasyntax() takes four arguments; the first is the dfa to operate on, the
+   second sets the syntax bits described earlier in this file, the third sets
+   the case-folding flag, and the fourth specifies the line terminator. */
+extern void dfasyntax (struct dfa *, reg_syntax_t, bool, unsigned char);
 
 /* Compile the given string of the given length into the given struct dfa.
    Final argument is a flag specifying whether to build a searching or an
diff --git a/src/dfasearch.c b/src/dfasearch.c
index 222232c..3dbf76b 100644
--- a/src/dfasearch.c
+++ b/src/dfasearch.c
@@ -123,10 +123,12 @@ GEAcompile (char const *pattern, size_t size, reg_syntax_t syntax_bits)
   size_t total = size;
   char *motif;
 
+  dfa = dfaalloc ();
+
   if (match_icase)
     syntax_bits |= RE_ICASE;
   re_set_syntax (syntax_bits);
-  dfasyntax (syntax_bits, match_icase, eolbyte);
+  dfasyntax (dfa, syntax_bits, match_icase, eolbyte);
 
   /* For GNU regex, pass the patterns separately to detect errors like
      "[\nallo\n]\n", where the patterns are "[", "allo" and "]", and
@@ -206,7 +208,6 @@ GEAcompile (char const *pattern, size_t size, reg_syntax_t syntax_bits)
   else
     motif = NULL;
 
-  dfa = dfaalloc ();
   dfacomp (pattern, size, dfa, 1);
   kwsmusts ();
 
diff --git a/tests/dfa-match-aux.c b/tests/dfa-match-aux.c
index af933ff..25b0535 100644
--- a/tests/dfa-match-aux.c
+++ b/tests/dfa-match-aux.c
@@ -54,8 +54,8 @@ main (int argc, char **argv)
 
   setlocale (LC_ALL, "");
 
-  dfasyntax (RE_SYNTAX_GREP | RE_NO_EMPTY_RANGES, 0, '\n');
   dfa = dfaalloc ();
+  dfasyntax (dfa, RE_SYNTAX_GREP | RE_NO_EMPTY_RANGES, 0, '\n');
   dfacomp (argv[1], strlen (argv[1]), dfa, 0);
 
   beg = argv[2];
-- 
2.8.1





Reply sent to Jim Meyering <jim <at> meyering.net>:
You have taken responsibility. (Sun, 21 Aug 2016 00:53:01 GMT) Full text and rfc822 format available.

Notification sent to Zev Weiss <zev <at> bewilderbeest.net>:
bug acknowledged by developer. (Sun, 21 Aug 2016 00:53:01 GMT) Full text and rfc822 format available.

Message #25 received at 24259-done <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Zev Weiss <zev <at> bewilderbeest.net>
Cc: 24259-done <at> debbugs.gnu.org
Subject: Re: bug#24259: [PATCH 0/6] dfa: thread safety
Date: Sat, 20 Aug 2016 17:52:09 -0700
On Thu, Aug 18, 2016 at 3:50 AM, Zev Weiss <zev <at> bewilderbeest.net> wrote:
> Hello,
>
> Re: Jim Meyering's proposal to move grep's dfa.c into gnulib, I was
> hoping I might be able to sneak in the dfa-related parts of my
> multithreading patch set before it happens.
>
> These are all the patches from that set that touch src/dfa.[ch], and
> are all oriented toward making dfa.c thread-safe (which seems like it
> might be a desirable thing for librarification anyway).  Starting with
> the fourth patch there are some slight changes to the external API, so
> other projects using dfa.c may need some minor corresponding
> adjustments as a result (but nothing, I think, too onerous).  The
> first three should have no impact on any users, however.
>
> (The bulk of the changes in these patches are fairly mechanical; for
> some of them something like `git show --color-words' on a local branch
> might make the review process slightly less tedious.)
>
> Combined diffstat for these six follows.
>
> Thanks,
> Zev Weiss
>
>  src/dfa.c             | 854 ++++++++++++++++++++++--------------------
>  src/dfa.h             |  13 +-
>  src/dfasearch.c       |   7 +-
>  src/grep.c            |   2 +
>  src/kwsearch.c        |   2 +-
>  src/pcresearch.c      |   2 +-
>  tests/dfa-match-aux.c |   4 +-

Thank you for separating those so nicely.
I have pushed your patches 2..6, now, so will close this.
The only changes I made were to the formatting and member ordering of
struct lexer_state. At first glance, I thought reordering would save
some space, but ended up with a size of 56 bytes on x86_64 regardless.
However, I did move both "int cur_mb_len" and "bool laststart" to the
end, and note that changing the type of cur_mb_len to short int
(perhaps in a later patch) will reduce the size of that struct to 48.




Information forwarded to bug-grep <at> gnu.org:
bug#24259; Package grep. (Sun, 21 Aug 2016 01:18:01 GMT) Full text and rfc822 format available.

Message #28 received at 24259-done <at> debbugs.gnu.org (full text, mbox):

From: Zev Weiss <zev <at> bewilderbeest.net>
To: Jim Meyering <jim <at> meyering.net>
Cc: 24259-done <at> debbugs.gnu.org
Subject: Re: bug#24259: [PATCH 0/6] dfa: thread safety
Date: Sat, 20 Aug 2016 20:17:19 -0500
[Message part 1 (text/plain, inline)]
On Sat, Aug 20, 2016 at 05:52:09PM -0700, Jim Meyering wrote:
>On Thu, Aug 18, 2016 at 3:50 AM, Zev Weiss <zev <at> bewilderbeest.net> wrote:
>> Hello,
<snip>
>
>Thank you for separating those so nicely.
>I have pushed your patches 2..6, now, so will close this.
>The only changes I made were to the formatting and member ordering of
>struct lexer_state. At first glance, I thought reordering would save
>some space, but ended up with a size of 56 bytes on x86_64 regardless.
>However, I did move both "int cur_mb_len" and "bool laststart" to the
>end, and note that changing the type of cur_mb_len to short int
>(perhaps in a later patch) will reduce the size of that struct to 48.

Thanks, Jim.

Also, after your comments on the first patch I noticed there were a 
couple more function parameters that could have been const in the 
subsequent patches as well; the attached patch marks them as such.


Zev

[0001-dfa-constify-some-function-parameters.patch (text/x-diff, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#24259; Package grep. (Sun, 21 Aug 2016 01:22:02 GMT) Full text and rfc822 format available.

Message #31 received at 24259-done <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Zev Weiss <zev <at> bewilderbeest.net>
Cc: 24259-done <at> debbugs.gnu.org
Subject: Re: bug#24259: [PATCH 0/6] dfa: thread safety
Date: Sat, 20 Aug 2016 18:20:44 -0700
On Sat, Aug 20, 2016 at 6:17 PM, Zev Weiss <zev <at> bewilderbeest.net> wrote:
> On Sat, Aug 20, 2016 at 05:52:09PM -0700, Jim Meyering wrote:
>>
>> On Thu, Aug 18, 2016 at 3:50 AM, Zev Weiss <zev <at> bewilderbeest.net> wrote:
>>>
>>> Hello,
>
> <snip>
>>
>>
>> Thank you for separating those so nicely.
>> I have pushed your patches 2..6, now, so will close this.
>> The only changes I made were to the formatting and member ordering of
>> struct lexer_state. At first glance, I thought reordering would save
>> some space, but ended up with a size of 56 bytes on x86_64 regardless.
>> However, I did move both "int cur_mb_len" and "bool laststart" to the
>> end, and note that changing the type of cur_mb_len to short int
>> (perhaps in a later patch) will reduce the size of that struct to 48.
>
>
> Thanks, Jim.
>
> Also, after your comments on the first patch I noticed there were a couple
> more function parameters that could have been const in the subsequent
> patches as well; the attached patch marks them as such.

Thanks!  Pushed.




Information forwarded to bug-grep <at> gnu.org:
bug#24259; Package grep. (Tue, 23 Aug 2016 08:22:02 GMT) Full text and rfc822 format available.

Message #34 received at 24259 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Zev Weiss <zev <at> bewilderbeest.net>, 24259 <at> debbugs.gnu.org
Subject: Re: bug#24259: [PATCH 0/6] dfa: thread safety
Date: Tue, 23 Aug 2016 01:21:42 -0700
[Message part 1 (text/plain, inline)]
Thanks for doing all that; the changes all look good. I found a few minor issues 
with the changes and installed the attached cleanup patch. Among other things, 
this cleanup abbreviates some member names from FOOstate to FOO, because they 
are used so often in the code (and are private to this module anyway) that the 
repeated "state"s made it a bit harder to read.
[0001-dfa-minor-thread-safety-cleanups.patch (text/x-diff, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#24259; Package grep. (Thu, 01 Sep 2016 03:23:02 GMT) Full text and rfc822 format available.

Message #37 received at 24259 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Zev Weiss <zev <at> bewilderbeest.net>, 24259 <at> debbugs.gnu.org
Subject: Re: bug#24259: [PATCH 0/6] dfa: thread safety
Date: Wed, 31 Aug 2016 20:22:36 -0700
[Message part 1 (text/plain, inline)]
Following up further on this, I installed the attached patch into the grep 
master on Savannah. This patch shouldn't affect grep's behavior, or 
significantly affect its efficiency. The idea is to make the DFA code usable in 
multilocale apps, plus it should make the code a bit cleaner even in a 
single-locale environment.
[0001-dfa-make-dfa.c-fully-thread-safe.txt (text/plain, attachment)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 29 Sep 2016 11:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 7 years and 183 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.