GNU bug report logs -
#17328
dfa.c will fail if used on more than one DFA
Previous Next
Reported by: Aharon Robbins <arnold <at> skeeve.com>
Date: Wed, 23 Apr 2014 18:37:02 UTC
Severity: normal
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 17328 in the body.
You can then email your comments to 17328 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#17328
; Package
grep
.
(Wed, 23 Apr 2014 18:37:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Aharon Robbins <arnold <at> skeeve.com>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Wed, 23 Apr 2014 18:37:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Hello.
There is a built-in assumption that the routines in dfa.c will only
be used on one struct dfa. This is true for grep but not true for gawk.
I found this when trying to update gawk's dfa with all the changes that
have been coming through. The patch is below.
Thanks,
Arnold
-----------------------
diff --git a/src/dfa.c b/src/dfa.c
index 65fc03d..85ab9bd 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -3244,15 +3244,10 @@ dfaexec (struct dfa *d, char const *begin, char *end,
if (d->mb_cur_max > 1)
{
- static bool mb_alloc = false;
memset (&mbs, 0, sizeof (mbstate_t));
- if (!mb_alloc)
- {
- d->mb_match_lens = xnmalloc (d->nleaves, sizeof *d->mb_match_lens);
- d->mb_follows = xmalloc (sizeof *d->mb_follows);
- alloc_position_set (d->mb_follows, d->nleaves);
- mb_alloc = true;
- }
+ d->mb_match_lens = xnmalloc (d->nleaves, sizeof *d->mb_match_lens);
+ d->mb_follows = xmalloc (sizeof *d->mb_follows);
+ alloc_position_set (d->mb_follows, d->nleaves);
}
for (;;)
@@ -3434,8 +3429,13 @@ free_mbdata (struct dfa *d)
{
free (d->mb_follows->elems);
free (d->mb_follows);
+ d->mb_follows = NULL;
+ }
+ if (d->mb_match_lens)
+ {
+ free (d->mb_match_lens);
+ d->mb_match_lens = NULL;
}
- free (d->mb_match_lens);
}
/* Initialize the components of a dfa that the other routines don't
Reply sent
to
Paul Eggert <eggert <at> cs.ucla.edu>
:
You have taken responsibility.
(Wed, 23 Apr 2014 19:50:03 GMT)
Full text and
rfc822 format available.
Notification sent
to
Aharon Robbins <arnold <at> skeeve.com>
:
bug acknowledged by developer.
(Wed, 23 Apr 2014 19:50:04 GMT)
Full text and
rfc822 format available.
Message #10 received at 17328-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Thanks for reporting that. That static variable was the result of a
recent optimization. I guess we'll need to optimize in a different
way. I noticed another static variable that dfaexec also uses, namely
'mbs'; this needs to be moved into the struct dfa, for the benefit of
any applications that really need stateful encodings. A bonus is that I
expect this'll make the dfa code run a bit faster. I installed the
attached patch, which I hope addresses the issues you raised along with
the mbs issue.
The code still needs more work in this area. There shouldn't be any
static variables at all, even when parsing, though this would change the
API. And the code isn't consistent about referring to dfa->mb_cur_max
versus MB_CUR_MAX; not sure why that is.
[0001-dfa-omit-static-variables-that-limited-dfaexec-to-on.patch (text/x-patch, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#17328
; Package
grep
.
(Wed, 23 Apr 2014 22:48:02 GMT)
Full text and
rfc822 format available.
Message #13 received at 17328-done <at> debbugs.gnu.org (full text, mbox):
On 04/24/14 05:19, Paul Eggert wrote:
> Thanks for reporting that. That static variable was the result of a recent optimization. I guess we'll need to optimize in a different way. I noticed another static variable that dfaexec also uses, namely 'mbs'; this needs to be moved into the struct dfa, for the benefit of any applications that really need stateful encodings. A bonus is that I expect this'll make the dfa code run a bit faster. I installed the attached patch, which I hope addresses the issues you raised along with the mbs issue.
>
> The code still needs more work in this area. There shouldn't be any static variables at all, even when parsing, though this would change the API. And the code isn't consistent about referring to dfa->mb_cur_max versus MB_CUR_MAX; not sure why that is.
The modules created by my "untangle" script are working towards this goal:
Almost all code in fsalex, fsaparse, fsatoken and fsamusts have *no*
static variables (I found a couple of exceptions, e.g. fsalex has an
initialise-once-only static variable in function using_simple_locale,
but this is easily fixed). The context is explicit in the API: An
opaque pointer to outsiders, but expanding to a struct pointer internally.
Charclass has one hidden static pointer (a list of pools of classes), but
is explicitly set up to allow multiple lexers/parsers etc to exist in
parallel. This list is not protected by mutexes, and so is not
thread-safe, but otherwise is capable of handling multiple clients in
parallel (including reusing charclasses across clients).
Fsalex goes even further, and tries to set itself up to be correct even
if the locale is changed later: It explicitly mines (part of) the
current locale for information when fsalex_syntax () is called, and uses
that thereafter. Clients dealing with token streams from fsalex, such
as fsaparse, need to depend on the locale snapshot of the lexer in order
to make correct decisions: An explicit API that I omitted to mention
earlier, "proto-lexparse.h", lays out the protocol for information
exchange between any lexer and any parser in a module-agnostic fashion.
The interface is incomplete, but currently includes the following
opcodes and a generic function to exchange information:
typedef enum proto_lexparse_opcode_enum
{
PROTO_LEXPARSE_OP_GET_LOCALE,
PROTO_LEXPARSE_OP_GET_IS_MULTIBYTE_ENV,
PROTO_LEXPARSE_OP_GET_REPMN_MIN,
PROTO_LEXPARSE_OP_GET_REPMN_MAX,
PROTO_LEXPARSE_OP_GET_WIDE_CHAR,
PROTO_LEXPARSE_OP_GET_DOTCLASS,
} proto_lexparse_opcode_t;
typedef int proto_lexparse_exchange_fn_t (void *lexer_context,
proto_lexparse_opcode_t opcode,
void *parameter);
My long-term hope is that the need for this "exchange" function
would dwindle when the token stream delivered by the lexer is reworked
to deliver information more directly that it does at present, but this
is a non-trivial job, and can wait until later.
At present, the untangle code focuses on the tokens/class/lex/parse/musts
code, and does not try to break out the remaining high-level dfa code
at present. Having a "dfa->" context reference for the remaining code
should be easy, as this code is already set up for multiple instances in
the original.
--------
I will rebase the untangle script sometime in the next week; I'm
waiting for things to settle down a bit before undertaking this work,
as rebasing the script, while automated to a fair extent, is still a
considerable effort. As the gap between my reworking of the code and
the current dfa.c code grows, the effort needed to analyse and possibly
reinterpret modified code in a different form to fit into the
"untangle"d layout grows.
Finally, of course, I've been laying low, trying to let people sort
out the next release without having the untangle code as an immediate
distraction. I've broken my silence here, as the comments regarding
both performance and the desire to eliminate static variables align
very closely with the effort I've already expended writing "untangle",
cheers,
behoffski (Brenton Hoff)
Programmer, Grouse Software
Information forwarded
to
bug-grep <at> gnu.org
:
bug#17328
; Package
grep
.
(Thu, 24 Apr 2014 00:17:02 GMT)
Full text and
rfc822 format available.
Message #16 received at 17328 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
I confirmed memory leak by below after omit-static-variables.
$ yes abcdabc | head -50000000 >k
$ env LC_ALL=ja_JP.eucJP src/grep -v abcd.bc k
I send the patch to fix it.
[patch.txt (text/plain, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#17328
; Package
grep
.
(Thu, 24 Apr 2014 06:24:02 GMT)
Full text and
rfc822 format available.
Message #19 received at 17328 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Thanks. We can improve on that a bit, by using one of the pointers to
indicate whether the storage is allocated, instead of requiring an extra
boolean for this information. I pushed the attached patch.
[0001-dfa-fix-memory-leak-reintroduced-by-previous-patch.patch (text/plain, attachment)]
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Thu, 22 May 2014 11:24:03 GMT)
Full text and
rfc822 format available.
This bug report was last modified 9 years and 344 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.