GNU bug report logs - #17328
dfa.c will fail if used on more than one DFA

Previous Next

Package: grep;

Reported by: Aharon Robbins <arnold <at> skeeve.com>

Date: Wed, 23 Apr 2014 18:37:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 17328 in the body.
You can then email your comments to 17328 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#17328; Package grep. (Wed, 23 Apr 2014 18:37:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Aharon Robbins <arnold <at> skeeve.com>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Wed, 23 Apr 2014 18:37:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Aharon Robbins <arnold <at> skeeve.com>
To: bug-grep <at> gnu.org
Subject: dfa.c will fail if used on more than one DFA
Date: Wed, 23 Apr 2014 21:36:11 +0300
Hello.

There is a built-in assumption that the routines in dfa.c will only
be used on one struct dfa.  This is true for grep but not true for gawk.
I found this when trying to update gawk's dfa with all the changes that
have been coming through.  The patch is below.

Thanks,

Arnold
-----------------------
diff --git a/src/dfa.c b/src/dfa.c
index 65fc03d..85ab9bd 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -3244,15 +3244,10 @@ dfaexec (struct dfa *d, char const *begin, char *end,
 
   if (d->mb_cur_max > 1)
     {
-      static bool mb_alloc = false;
       memset (&mbs, 0, sizeof (mbstate_t));
-      if (!mb_alloc)
-        {
-          d->mb_match_lens = xnmalloc (d->nleaves, sizeof *d->mb_match_lens);
-          d->mb_follows = xmalloc (sizeof *d->mb_follows);
-          alloc_position_set (d->mb_follows, d->nleaves);
-          mb_alloc = true;
-        }
+      d->mb_match_lens = xnmalloc (d->nleaves, sizeof *d->mb_match_lens);
+      d->mb_follows = xmalloc (sizeof *d->mb_follows);
+      alloc_position_set (d->mb_follows, d->nleaves);
     }
 
   for (;;)
@@ -3434,8 +3429,13 @@ free_mbdata (struct dfa *d)
     {
       free (d->mb_follows->elems);
       free (d->mb_follows);
+      d->mb_follows = NULL;
+    }
+  if (d->mb_match_lens)
+    {
+      free (d->mb_match_lens);
+      d->mb_match_lens = NULL;
     }
-  free (d->mb_match_lens);
 }
 
 /* Initialize the components of a dfa that the other routines don't




Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Wed, 23 Apr 2014 19:50:03 GMT) Full text and rfc822 format available.

Notification sent to Aharon Robbins <arnold <at> skeeve.com>:
bug acknowledged by developer. (Wed, 23 Apr 2014 19:50:04 GMT) Full text and rfc822 format available.

Message #10 received at 17328-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Aharon Robbins <arnold <at> skeeve.com>, 17328-done <at> debbugs.gnu.org
Subject: Re: bug#17328: dfa.c will fail if used on more than one DFA
Date: Wed, 23 Apr 2014 12:49:29 -0700
[Message part 1 (text/plain, inline)]
Thanks for reporting that.  That static variable was the result of a 
recent optimization.  I guess we'll need to optimize in a different 
way.  I noticed another static variable that dfaexec also uses, namely 
'mbs'; this needs to be moved into the struct dfa, for the benefit of 
any applications that really need stateful encodings.  A bonus is that I 
expect this'll make the dfa code run a bit faster. I installed the 
attached patch, which I hope addresses the issues you raised along with 
the mbs issue.

The code still needs more work in this area.  There shouldn't be any 
static variables at all, even when parsing, though this would change the 
API.  And the code isn't consistent about referring to dfa->mb_cur_max 
versus MB_CUR_MAX; not sure why that is.
[0001-dfa-omit-static-variables-that-limited-dfaexec-to-on.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#17328; Package grep. (Wed, 23 Apr 2014 22:48:02 GMT) Full text and rfc822 format available.

Message #13 received at 17328-done <at> debbugs.gnu.org (full text, mbox):

From: behoffski <behoffski <at> grouse.com.au>
To: 17328-done <at> debbugs.gnu.org
Subject: Re: bug#17328: dfa.c will fail if used on more than one DFA
Date: Thu, 24 Apr 2014 08:17:12 +0930
On 04/24/14 05:19, Paul Eggert wrote:
> Thanks for reporting that.  That static variable was the result of a recent optimization.  I guess we'll need to optimize in a different way.  I noticed another static variable that dfaexec also uses, namely 'mbs'; this needs to be moved into the struct dfa, for the benefit of any applications that really need stateful encodings.  A bonus is that I expect this'll make the dfa code run a bit faster. I installed the attached patch, which I hope addresses the issues you raised along with the mbs issue.
>
> The code still needs more work in this area.  There shouldn't be any static variables at all, even when parsing, though this would change the API.  And the code isn't consistent about referring to dfa->mb_cur_max versus MB_CUR_MAX; not sure why that is.

The modules created by my "untangle" script are working towards this goal:
Almost all code in fsalex, fsaparse, fsatoken and fsamusts have *no*
static variables (I found a couple of exceptions, e.g. fsalex has an
initialise-once-only static variable in function using_simple_locale,
but this is easily fixed).  The context is explicit in the API:  An
opaque pointer to outsiders, but expanding to a struct pointer internally.

Charclass has one hidden static pointer (a list of pools of classes), but
is explicitly set up to allow multiple lexers/parsers etc to exist in
parallel.  This list is not protected by mutexes, and so is not
thread-safe, but otherwise is capable of handling multiple clients in
parallel (including reusing charclasses across clients).

Fsalex goes even further, and tries to set itself up to be correct even
if the locale is changed later:  It explicitly mines (part of) the
current locale for information when fsalex_syntax () is called, and uses
that thereafter.  Clients dealing with token streams from fsalex, such
as fsaparse, need to depend on the locale snapshot of the lexer in order
to make correct decisions:  An explicit API that I omitted to mention
earlier, "proto-lexparse.h", lays out the protocol for information
exchange between any lexer and any parser in a module-agnostic fashion.
The interface is incomplete, but currently includes the following
opcodes and a generic function to exchange information:

typedef enum proto_lexparse_opcode_enum
{
  PROTO_LEXPARSE_OP_GET_LOCALE,
  PROTO_LEXPARSE_OP_GET_IS_MULTIBYTE_ENV,
  PROTO_LEXPARSE_OP_GET_REPMN_MIN,
  PROTO_LEXPARSE_OP_GET_REPMN_MAX,
  PROTO_LEXPARSE_OP_GET_WIDE_CHAR,
  PROTO_LEXPARSE_OP_GET_DOTCLASS,
} proto_lexparse_opcode_t;

typedef int proto_lexparse_exchange_fn_t (void *lexer_context,
                                          proto_lexparse_opcode_t opcode,
                                          void *parameter);

My long-term hope is that the need for this "exchange" function
would dwindle when the token stream delivered by the lexer is reworked
to deliver information more directly that it does at present, but this
is a non-trivial job, and can wait until later.

At present, the untangle code focuses on the tokens/class/lex/parse/musts
code, and does not try to break out the remaining high-level dfa code
at present.  Having a "dfa->" context reference for the remaining code
should be easy, as this code is already set up for multiple instances in
the original.

--------

I will rebase the untangle script sometime in the next week; I'm
waiting for things to settle down a bit before undertaking this work,
as rebasing the script, while automated to a fair extent, is still a
considerable effort.  As the gap between my reworking of the code and
the current dfa.c code grows, the effort needed to analyse and possibly
reinterpret modified code in a different form to fit into the
"untangle"d layout grows.

Finally, of course, I've been laying low, trying to let people sort
out the next release without having the untangle code as an immediate
distraction.  I've broken my silence here, as the comments regarding
both performance and the desire to eliminate static variables align
very closely with the effort I've already expended writing "untangle",

cheers,

behoffski (Brenton Hoff)
Programmer, Grouse Software





Information forwarded to bug-grep <at> gnu.org:
bug#17328; Package grep. (Thu, 24 Apr 2014 00:17:02 GMT) Full text and rfc822 format available.

Message #16 received at 17328 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: 17328 <at> debbugs.gnu.org,
 eggert <at> cs.ucla.edu,
 arnold <at> skeeve.com
Subject: bug#17328: dfa.c will fail if used on more than one DFA
Date: Thu, 24 Apr 2014 09:16:19 +0900
[Message part 1 (text/plain, inline)]
I confirmed memory leak by below after omit-static-variables.

$ yes abcdabc | head -50000000 >k
$ env LC_ALL=ja_JP.eucJP src/grep -v abcd.bc k

I send the patch to fix it.
[patch.txt (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#17328; Package grep. (Thu, 24 Apr 2014 06:24:02 GMT) Full text and rfc822 format available.

Message #19 received at 17328 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, 17328 <at> debbugs.gnu.org, 
 arnold <at> skeeve.com
Subject: Re: bug#17328: dfa.c will fail if used on more than one DFA
Date: Wed, 23 Apr 2014 23:23:20 -0700
[Message part 1 (text/plain, inline)]
Thanks.  We can improve on that a bit, by using one of the pointers to 
indicate whether the storage is allocated, instead of requiring an extra 
boolean for this information.  I pushed the attached patch.
[0001-dfa-fix-memory-leak-reintroduced-by-previous-patch.patch (text/plain, attachment)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 22 May 2014 11:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 9 years and 344 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.