GNU bug report logs -
#30520
sed: silently fail substitution for 2GB+ single-line file
Previous Next
Reported by: YushiOMOTE <yushiomote <at> gmail.com>
Date: Mon, 19 Feb 2018 07:26:02 UTC
Severity: normal
Done: Jim Meyering <jim <at> meyering.net>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 30520 in the body.
You can then email your comments to 30520 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-sed <at> gnu.org
:
bug#30520
; Package
sed
.
(Mon, 19 Feb 2018 07:26:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
YushiOMOTE <yushiomote <at> gmail.com>
:
New bug report received and forwarded. Copy sent to
bug-sed <at> gnu.org
.
(Mon, 19 Feb 2018 07:26:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
The following behavior is observed with sed version 4.2.2 (x86_64 / CentOS
7.2).
(It's also observed with the current head of the git repo:
41e506e12b363628af29b089f55585b17e1389e6)
When I pass a 2GB+-single-line to sed for substitution, sed exits with
success status code but it silently fails substitution.
```
$ sed `s/aaa/bbb/g` 20GB+-single-line-file.txt > out.txt && echo OK
OK
(...but nothing gets substituted even though the target string exists)
```
In summary, the cause seems that sed isn't checking error return values
from `re_search <at> glibc`.
What is happening in detail seems as follows.
At the beginning of `do_subst` function (sed/execute.c),
```
/* The first part of the loop optimizes s/xxx// when xxx is at the
start, and s/xxx$// */
if (!match_regex(sub->regx, line.active, line.length, start,
®s, sub->max_id + 1))
return;
```
`line.length` (unsigned long), is set to the value larger than INT_MAX. And
when `match_regex` (sed/regexp.c) calls `re_search`,
```
ret = re_search (®ex->pattern, buf, buflen, buf_start_offset,
buflen - buf_start_offset,
regsize ? regarray : NULL);
```
`buflen` here is equal to `line.length` (the value larger than INT_MAX) but
the corresponding argument of `re_search` is `int`. The value is converted
to a negative number. Then, `re_search` returns -1 due to its error
checking. But the return value is not treated by sed and no errors are
raised.
A possible fix could be adding an error checking for `re_search` (like this
patch).
Or, we can check if the value of line length exceeds INT_MAX before
reaching `re_search`. (`grep` checks it and returns with error status code)
Best regards,
Yushi
[Message part 2 (text/html, inline)]
[0001-sed-check-errors-of-re_search.patch (application/x-patch, attachment)]
Information forwarded
to
bug-sed <at> gnu.org
:
bug#30520
; Package
sed
.
(Mon, 19 Feb 2018 20:33:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 30520 <at> debbugs.gnu.org (full text, mbox):
Hello,
On Mon, Feb 19, 2018 at 06:51:04AM +0000, YushiOMOTE wrote:
> When I pass a 2GB+-single-line to sed for substitution, sed exits with
> success status code but it silently fails substitution.
>
> ```
> $ sed `s/aaa/bbb/g` 20GB+-single-line-file.txt > out.txt && echo OK
> OK
> (...but nothing gets substituted even though the target string exists)
Thank you for reporting this issue, and providing such detailed
analysis. I am able to reproduce this issue.
Few minor comments:
> In summary, the cause seems that sed isn't checking error return values
> from `re_search <at> glibc`.
Technically, it's from gnulib, not glibc.
The code is in https://opengrok.housegordon.com/source/xref/gnulib/lib/regexec.c#369
the function 're_search_stub'.
> What is happening in detail seems as follows.
>
> At the beginning of `do_subst` function (sed/execute.c),
> [...]
> `line.length` (unsigned long), is set to the value larger than INT_MAX. And
> when `match_regex` (sed/regexp.c) calls `re_search`,
>
> ```
> ret = re_search (®ex->pattern, buf, buflen, buf_start_offset,
> buflen - buf_start_offset,
> regsize ? regarray : NULL);
> ```
>
> `buflen` here is equal to `line.length` (the value larger than INT_MAX) but
> the corresponding argument of `re_search` is `int`. The value is converted
> to a negative number. Then, `re_search` returns -1 due to its error
> checking. But the return value is not treated by sed and no errors are
> raised.
From a cursory look, it seems "re_search_stub" does not have error
checking for this specific case (length<0), and it continues with its
string search, but finds nothing - thus returning -1.
It returns "-1" both when the pattern is not found, and when there
are such value errors.
Later code in 'regexp.c:match_regex()' does check the returned value
like so:
return (ret > -1);
Therefore, "match_regex" returns FALSE in this case,
and "do_subst" simply concludes that the string was not found.
Sed then terminates without performing any replacement,
and without reporting any errors.
> A possible fix could be adding an error checking for `re_search` (like this
> patch).
I'm not sure this would work.
The code already checks for -1, which is used to indicate "pattern not found".
The gnulib implementation of "re_search" is described here:
https://www.gnu.org/software/gnulib/manual/html_node/GNU-Searching.html
It says:
"If no match is found, re_search returns -1.
If a match is found, it returns the index where the match began.
If an internal error happens, it returns -2."
However it seems that "internal errors" only relates to failed memory allocations.
Out of range values are treated by "re_search" as simple "pattern not found",
as evident by this check:
/* Check for out-of-range. */
if (BE (start < 0 || start > length, 0))
return -1;
In https://opengrok.housegordon.com/source/xref/gnulib/lib/regexec.c#381
> Or, we can check if the value of line length exceeds INT_MAX before
> reaching `re_search`. (`grep` checks it and returns with error status code)
This is likely the simplest solution for now, without changing gnulib itself.
regards,
- assaf
Information forwarded
to
bug-sed <at> gnu.org
:
bug#30520
; Package
sed
.
(Fri, 23 Mar 2018 00:13:01 GMT)
Full text and
rfc822 format available.
Message #11 received at 30520 <at> debbugs.gnu.org (full text, mbox):
On Mon, Feb 19, 2018 at 12:32 PM, Assaf Gordon <assafgordon <at> gmail.com> wrote:
> Hello,
>
> On Mon, Feb 19, 2018 at 06:51:04AM +0000, YushiOMOTE wrote:
>> When I pass a 2GB+-single-line to sed for substitution, sed exits with
>> success status code but it silently fails substitution.
...
>> Or, we can check if the value of line length exceeds INT_MAX before
>> reaching `re_search`. (`grep` checks it and returns with error status code)
>
> This is likely the simplest solution for now, without changing gnulib itself.
Thank you both. I too prefer this approach. Would you like to do that, Assaf?
Information forwarded
to
bug-sed <at> gnu.org
:
bug#30520
; Package
sed
.
(Fri, 23 Mar 2018 01:43:01 GMT)
Full text and
rfc822 format available.
Message #14 received at 30520 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Thu, Mar 22, 2018 at 05:12:11PM -0700, Jim Meyering wrote:
> On Mon, Feb 19, 2018 at 12:32 PM, Assaf Gordon <assafgordon <at> gmail.com> wrote:
> > On Mon, Feb 19, 2018 at 06:51:04AM +0000, YushiOMOTE wrote:
> >> When I pass a 2GB+-single-line to sed for substitution, sed exits with
> >> success status code but it silently fails substitution.
> ...
> >> Or, we can check if the value of line length exceeds INT_MAX before
> >> reaching `re_search`. (`grep` checks it and returns with error status code)
> >
> > This is likely the simplest solution for now, without changing gnulib itself.
>
> Thank you both. I too prefer this approach. Would you like to do that, Assaf?
Yes, attached a suggested patch.
Comments welcomed,
- assaf
[0001-sed-reject-RE-searches-on-buffers-larger-than-INT_MA.patch (text/x-diff, attachment)]
Information forwarded
to
bug-sed <at> gnu.org
:
bug#30520
; Package
sed
.
(Fri, 23 Mar 2018 05:18:02 GMT)
Full text and
rfc822 format available.
Message #17 received at 30520 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Thu, Mar 22, 2018 at 6:41 PM, Assaf Gordon <assafgordon <at> gmail.com> wrote:
> On Thu, Mar 22, 2018 at 05:12:11PM -0700, Jim Meyering wrote:
>> On Mon, Feb 19, 2018 at 12:32 PM, Assaf Gordon <assafgordon <at> gmail.com> wrote:
>> > On Mon, Feb 19, 2018 at 06:51:04AM +0000, YushiOMOTE wrote:
>> >> When I pass a 2GB+-single-line to sed for substitution, sed exits with
>> >> success status code but it silently fails substitution.
>> ...
>> >> Or, we can check if the value of line length exceeds INT_MAX before
>> >> reaching `re_search`. (`grep` checks it and returns with error status code)
>> >
>> > This is likely the simplest solution for now, without changing gnulib itself.
>>
>> Thank you both. I too prefer this approach. Would you like to do that, Assaf?
>
> Yes, attached a suggested patch.
Great! Thank you for the quick work!
Here are some small suggested adjustments:
- add targets so the advice on running the very expensive tests works
- minor wording changes
- remove apparently unnecessary "< /dev/null" from new test
[sed-2G-input-nits.diff (application/octet-stream, attachment)]
Information forwarded
to
bug-sed <at> gnu.org
:
bug#30520
; Package
sed
.
(Fri, 23 Mar 2018 05:49:02 GMT)
Full text and
rfc822 format available.
Message #20 received at 30520 <at> debbugs.gnu.org (full text, mbox):
On Thu, Mar 22, 2018 at 10:17 PM, Jim Meyering <jim <at> meyering.net> wrote:
> On Thu, Mar 22, 2018 at 6:41 PM, Assaf Gordon <assafgordon <at> gmail.com> wrote:
>> On Thu, Mar 22, 2018 at 05:12:11PM -0700, Jim Meyering wrote:
>>> On Mon, Feb 19, 2018 at 12:32 PM, Assaf Gordon <assafgordon <at> gmail.com> wrote:
>>> > On Mon, Feb 19, 2018 at 06:51:04AM +0000, YushiOMOTE wrote:
>>> >> When I pass a 2GB+-single-line to sed for substitution, sed exits with
>>> >> success status code but it silently fails substitution.
>>> ...
>>> >> Or, we can check if the value of line length exceeds INT_MAX before
>>> >> reaching `re_search`. (`grep` checks it and returns with error status code)
>>> >
>>> > This is likely the simplest solution for now, without changing gnulib itself.
>>>
>>> Thank you both. I too prefer this approach. Would you like to do that, Assaf?
>>
>> Yes, attached a suggested patch.
>
> Great! Thank you for the quick work!
> Here are some small suggested adjustments:
> - add targets so the advice on running the very expensive tests works
> - minor wording changes
> - remove apparently unnecessary "< /dev/null" from new test
I'll defer the first snapshot until after you push this fix.
Thanks again.
Information forwarded
to
bug-sed <at> gnu.org
:
bug#30520
; Package
sed
.
(Fri, 23 Mar 2018 15:05:02 GMT)
Full text and
rfc822 format available.
Message #23 received at 30520 <at> debbugs.gnu.org (full text, mbox):
On Thu, Mar 22, 2018 at 10:48:20PM -0700, Jim Meyering wrote:
> I'll defer the first snapshot until after you push this fix.
Thanks for the fixes, pushed with your edits here:
https://git.savannah.gnu.org/cgit/sed.git/commit/?id=5433dc245b222f6c98ab1436e170fd5e3e6e3907
Reply sent
to
Jim Meyering <jim <at> meyering.net>
:
You have taken responsibility.
(Mon, 26 Mar 2018 15:21:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
YushiOMOTE <yushiomote <at> gmail.com>
:
bug acknowledged by developer.
(Mon, 26 Mar 2018 15:21:02 GMT)
Full text and
rfc822 format available.
Message #28 received at 30520-done <at> debbugs.gnu.org (full text, mbox):
marking as "done"
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Tue, 24 Apr 2018 11:24:04 GMT)
Full text and
rfc822 format available.
This bug report was last modified 6 years and 8 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.