GNU bug report logs - #23269
new snapshot available: grep-2.24.13-bed6

Package: grep;

Reported by: Jim Meyering <jim <at> meyering.net>

Date: Mon, 11 Apr 2016 15:54:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 23269 in the body.
You can then email your comments to 23269 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-grep <at> gnu.org:
bug#23269; Package grep. (Mon, 11 Apr 2016 15:54:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Jim Meyering <jim <at> meyering.net>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Mon, 11 Apr 2016 15:54:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: bug-grep <at> gnu.org
Cc: TP coordinator <coordinator <at> translationproject.org>,
 platform-testers <at> gnu.org
Subject: new snapshot available: grep-2.24.13-bed6
Date: Mon, 11 Apr 2016 08:53:17 -0700

I plan to release grep-2.25 this week, so here's a snapshot of
the latest. Please beat it up and report success and failure here.

Thanks to Paul Eggert for fixing so many bugs (and especially
for the mbrtowc workaround in gnulib), and to Assaf Gordon for the
initial patch to make many of grep's diagnostics more informative.

[In case you're wondering why the mbrtowc work-around matters,
here's the story: I was dismayed to learn that even with the very
latest Fedora, glibc and grep-2.23 or grep-2.24, this

  printf '\344' | LC_ALL=C grep .

would print "Binary file (standard input) matches".
We should never get that "Binary file matches" diagnostic
when using the LC_ALL=C locale. Thanks to Björn JACKE
for noticing and reporting that. See http://bugs.gnu.org/23234
for full details. ]

grep snapshot:
  http://meyering.net/grep/grep-ss.tar.xz      1.3 MB
  http://meyering.net/grep/grep-ss.tar.xz.sig
  http://meyering.net/grep/grep-2.24.13-bed6.tar.xz

Changes in grep since v2.24:

Jim Meyering (5):
      maint: post-release administrivia
      maint: avoid spurious "binary file ... matches" in generated THANKS
      maint: move new 'Improvements' blurb into proper section
      tests: remove spurious test of egrep
      maint: remove unused mbtoupper function

Paul Eggert (8):
      grep: use errno consistently in write diagnostics
      grep: -oz now outputs null bytes, not newlines
      grep: -Pz no longer misdiagnoses [^a]
      tests: test egrep/fgrep help only if our grep
      Give another example of binary file processing
      build: update gnulib submodule to latest
      grep: in C locale, all bytes are valid characters
      grep: minor doc tweaks inspired by Debian


Changes in gnulib since v2.24:

* gnulib cd6a452...b7bc3c1 (55):
  > mbrtowc: work around glibc bug#19932
  > update from texinfo
  > autoupdate
  > stdint: detect good enough pre-C++11 stdint.h in C++ mode
  > argp: merge changes from glibc
  > Prefer American spelling for "initialize"
  > autoupdate
  > stddef: support configuring with g++
  > autoupdate
  > autoupdate
  > update from texinfo
  > test-framework-sh: minor cleanups
  > test-framework-sh: revert port to NetBSD 7.0
  > autoupdate
  > Port better to Alpine Linux
  > test-framework-sh: port to NetBSD 7.0
  > update from texinfo
  > gitlog-to-changelog: suppress ignored chatter
  > update from texinfo
  > update from texinfo
  > setlocale: add "sv" to Windows language table
  > update from texinfo
  > sys_select: port to new Cygwin
  > test-userspec.c: do not trigger gcc's new -Wmisleading-indentation
  > time_rz: port to clang -Wunused-const-variable
  > std-gnu11: improve clang support
  > select: port more to Intel 2016.1.150 compiler
  > select: try to port to 2016.1.150 compiler
  > localename-tests: memory allocation fixes
  > intprops: make .h file license match module
  > acl: fix missing return on Cygwin
  > update from texinfo
  > update from texinfo
  > extern-inline: port to PGI CC
  > update from texinfo
  > update from texinfo
  > signbit: port back to pre-C++11 GCC
  > mountlist: recognize autofs-mounted remote file systems, too
  > signbit: port to C++ with GCC 6
  > regex: make it closer to libc
  > regex: merge patches from libc
  > update from texinfo
  > update from texinfo
  > autoupdate
  > autoupdate
  > stdalign: port to older HP and IBM cc
  > stdalign: port to clang 3.7.0
  > update from texinfo
  > readdir_r: now obsolescent
  > Use modern texinfo when syncing install.texi output from autoconf
  > update from texinfo
  > sync install.texi from autoconf
  > misc: port better to gcc -fsanitize=address
  > update from texinfo
  > autoupdate

Information forwarded to bug-grep <at> gnu.org:
bug#23269; Package grep. (Mon, 11 Apr 2016 16:14:01 GMT) Full text and rfc822 format available.

Message #8 received at 23269 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: 23269 <at> debbugs.gnu.org
Cc: TP coordinator <coordinator <at> translationproject.org>,
 platform-testers <at> gnu.org
Subject: Re: bug#23269: new snapshot available: grep-2.24.13-bed6
Date: Mon, 11 Apr 2016 09:13:08 -0700

On Mon, Apr 11, 2016 at 8:53 AM, Jim Meyering <jim <at> meyering.net> wrote:
> [In case you're wondering why the mbrtowc work-around matters,
> here's the story: I was dismayed to learn that even with the very
> latest Fedora, glibc and grep-2.23 or grep-2.24, this
>
>   printf '\344' | LC_ALL=C grep .
>
> would print "Binary file (standard input) matches".
> We should never get that "Binary file matches" diagnostic
> when using the LC_ALL=C locale. Thanks to Björn JACKE
> for noticing and reporting that. See http://bugs.gnu.org/23234
> for full details. ]

To summarize, that problem was due to the way mbrtowc works
in the C/POSIX locale with certain C library runtime releases.
There, mbrtowc would report that bytes 128..255 were not valid
characters, thus evoking grep's "Binary file matches" diagnostic.
Paul's fix was to add configure-time tests to detect the problem
and (when detected) to enable a replacement mbrtowc function
that calls the underlying one, and corrects for any offending case.

This problem is likely to affect many more programs
than just grep, so we presume it will be fixed promptly, but
don't want to make grep's proper functioning depend on
an as-yet-unreleased C library.

Information forwarded to bug-grep <at> gnu.org:
bug#23269; Package grep. (Mon, 11 Apr 2016 16:31:01 GMT) Full text and rfc822 format available.

Message #11 received at 23269 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>, 23269 <at> debbugs.gnu.org
Cc: platform-testers <at> gnu.org
Subject: Re: bug#23269: new snapshot available: grep-2.24.13-bed6
Date: Mon, 11 Apr 2016 09:29:59 -0700

On 04/11/2016 09:13 AM, Jim Meyering wrote:
> This problem is likely to affect many more programs
> than just grep, so we presume it will be fixed promptly

I am not sure about how promptly it'll be fixed in glibc, as this may 
require more developer oomph in the localedata area. Although Bruno 
Haible did a nice analysis of the issue 
<https://sourceware.org/bugzilla/show_bug.cgi?id=19932> he had some 
qualms about changing this part of glibc, and anyway I expect he has few 
free cycles to think about this. And to be honest, fiddling with 
localedata is not my fave....

Since the problem has apparently been in glibc for a decade and a half, 
I'm a bit surprised nobody filed a bug report about this until now. 
Perhaps it's because apps that care about i18n and text processing 
(e.g., Emacs, Firefox) largely bypass mbrtowc and do all the decoding 
themselves?

Information forwarded to bug-grep <at> gnu.org:
bug#23269; Package grep. (Mon, 18 Apr 2016 05:46:02 GMT) Full text and rfc822 format available.

Message #14 received at 23269 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: "Nelson H. F. Beebe" <beebe <at> math.utah.edu>, 23269 <at> debbugs.gnu.org
Cc: Paul Eggert <eggert <at> twinsun.com>
Subject: MirBSD 10 i386 test failures [Re: grep-2.24.13-bed6 feedback
Date: Sun, 17 Apr 2016 22:45:13 -0700

[Nelson H. F. Beebe ran many tests and reported
the results privately ]

Thank you for the testing and reporting the results.
I was about to make the release when I saw your email.

Here's the first failure I have investigated:

+ tr2='\200'
+ echo X
+ tr X '\200'
+ LC_ALL=C
+ env -- tr X '\200'
++ wc -l
+ test 1 -eq 1
+ grep . in
+ fail=1
+ compare in out
+ compare_dev_null_ in out
+ test 2 = 2
+ test xin = x/dev/null
+ test xout = x/dev/null
+ return 2
+ case $? in
+ compare_ in out
+ diff -u in out
--- in  2016-04-15 13:19:50.797357000 -0600
+++ out 2016-04-15 13:19:50.797357000 -0600
@@ -1 +0,0 @@
-<80>
+ fail=1
...
FAIL c-locale (exit status: 1)

The failure is due to mirbsd's btowc, which is used in dfa.c for these:

/* Add this to the test for whether a byte is word-constituent, since on
   BSD-based systems, many values in the 128..255 range are classified as
   alphabetic, while on glibc-based systems, they are not.  */
#ifdef __GLIBC__
# define is_valid_unibyte_character(c) 1
#else
# define is_valid_unibyte_character(c) (btowc (c) != WEOF)
#endif

/* C is a "word-constituent" byte.  */
#define IS_WORD_CONSTITUENT(C) \
  (is_valid_unibyte_character (C) && (isalnum (C) || (C) == '_'))

The following two tables show the I for which btowc(I) == WEOF and for
which gnulib's btowc.c meet that same condition on mirbsd:

mirbsd$ LC_CTYPE=C LC_ALL=C ./a.out|fmt
194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211
212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229
230 231 232 233 234 235 236 237 238 239
mirbsd$ gcc btowc-test.c
mirbsd$ LC_CTYPE=C LC_ALL=C ./a.out|fmt
128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145
146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163
164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181
182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199
200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217
218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235
236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253
254 255
mirbsd$ uname -a
MirBSD mirbsd.vm.math.utah.edu 10 GENERIC#1359 i386 i386 AMD
Opteron(tm) Processor 6136 ("AuthenticAMD" 686-class, 512KB L2 cache)
MirBSD
mirbsd$ LC_ALL=C locale
LANG=
LC_CTYPE="en_US.UTF-8"
LC_COLLATE="C"
LC_TIME="C"
LC_NUMERIC="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=C

==========================
Normally, I would specify only LC_ALL=C,
but when I saw that the above invocation of
locale failed to set LC_CTYPE to "C". Explicitly
setting LC_CTYPE didn't make a difference.

Those differences lead to different "trans" (transition) tables in
dfa.c, and make dfaexec declare that "." does not match \200.

Conclusion: we'll have to make btowc work properly in the C locale, too.

Information forwarded to bug-grep <at> gnu.org:
bug#23269; Package grep. (Mon, 18 Apr 2016 06:41:02 GMT) Full text and rfc822 format available.

Message #17 received at 23269 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>, "Nelson H. F. Beebe"
 <beebe <at> math.utah.edu>, 23269 <at> debbugs.gnu.org
Cc: Paul Eggert <eggert <at> twinsun.com>
Subject: Re: bug#23269: MirBSD 10 i386 test failures [Re: grep-2.24.13-bed6
 feedback
Date: Sun, 17 Apr 2016 23:40:03 -0700

[Message part 1 (text/plain, inline)]

Jim Meyering wrote:
> Conclusion: we'll have to make btowc work properly in the C locale, too.

Perhaps something like the attached (untested) patch? The basic idea is to have 
btowc and mbtowc use a fixed mbrtowc if the latter has the C-locale problem in 
question. While we're at it, btowc should invoke mbrtowc not mbtowc, as btowc is 
thread-safe but mbtowc is not.

[mirbsd.diff (text/x-diff, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#23269; Package grep. (Mon, 18 Apr 2016 06:50:02 GMT) Full text and rfc822 format available.

Message #20 received at 23269 <at> debbugs.gnu.org (full text, mbox):

From: arnold <at> skeeve.com
To: jim <at> meyering.net, beebe <at> math.utah.edu, 23269 <at> debbugs.gnu.org
Cc: eggert <at> twinsun.com
Subject: Re: bug#23269: MirBSD 10 i386 test failures [Re: grep-2.24.13-bed6
 feedback
Date: Mon, 18 Apr 2016 00:49:10 -0600

Hi All.

Note that MirBSD's libc is badly broken. Even when LC_ALL=C MB_CUR_MAX
can be > 1. And perhaps other severe departures from reality.

There is code in gawk to deal with it - you can look at the gawk 4.1.3
tarball and various bits in the C code for LIBC_IS_BORKED (or some such).

For the next major release (gawk's master branch, no timeframe yet) I
removed all that code because it was exceedingly ugly and I think that
Nelson is the only one in the world who attempts to build gawk on MirBSD.

While this is admirable on his part, I finally decided that I didn't want
the headache of maintaining those changes.

So - Caveat Emptor; you may be twisting your code base for the benefit
of just a single system that's WAAAY out in left field.

My two cents worth.

Thanks,

Arnold

Information forwarded to bug-grep <at> gnu.org:
bug#23269; Package grep. (Mon, 18 Apr 2016 14:56:02 GMT) Full text and rfc822 format available.

Message #23 received at 23269 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Paul Eggert <eggert <at> twinsun.com>, 23269 <at> debbugs.gnu.org,
 "Nelson H. F. Beebe" <beebe <at> math.utah.edu>
Subject: Re: bug#23269: MirBSD 10 i386 test failures [Re: grep-2.24.13-bed6
 feedback
Date: Mon, 18 Apr 2016 07:54:51 -0700

On Sun, Apr 17, 2016 at 11:40 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> Jim Meyering wrote:
>>
>> Conclusion: we'll have to make btowc work properly in the C locale, too.
>
>
> Perhaps something like the attached (untested) patch? The basic idea is to
> have btowc and mbtowc use a fixed mbrtowc if the latter has the C-locale
> problem in question. While we're at it, btowc should invoke mbrtowc not
> mbtowc, as btowc is thread-safe but mbtowc is not.

Thanks for the quick patch.
I'm sure you intended this additional change, so
that the if-expression can sometimes be false:

+      if (mbrtowc (&wc, buf, 1, &mbs) >= 0)
-       if (mbrtowc (&wc, buf, 1, &mbs) < (size_t)-2)

with that, the btowc replacement function still
declares bytes 128..255 to be invalid in the C
locale.

Information forwarded to bug-grep <at> gnu.org:
bug#23269; Package grep. (Mon, 18 Apr 2016 15:01:02 GMT) Full text and rfc822 format available.

Message #26 received at 23269 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Aharon Robbins <arnold <at> skeeve.com>
Cc: Paul Eggert <eggert <at> twinsun.com>, 23269 <at> debbugs.gnu.org,
 "Nelson H. F. Beebe" <beebe <at> math.utah.edu>
Subject: Re: bug#23269: MirBSD 10 i386 test failures [Re: grep-2.24.13-bed6
 feedback
Date: Mon, 18 Apr 2016 08:00:08 -0700

On Sun, Apr 17, 2016 at 11:49 PM,  <arnold <at> skeeve.com> wrote:
> Hi All.
>
> Note that MirBSD's libc is badly broken. Even when LC_ALL=C MB_CUR_MAX
> can be > 1. And perhaps other severe departures from reality.
>
> There is code in gawk to deal with it - you can look at the gawk 4.1.3
> tarball and various bits in the C code for LIBC_IS_BORKED (or some such).
>
> For the next major release (gawk's master branch, no timeframe yet) I
> removed all that code because it was exceedingly ugly and I think that
> Nelson is the only one in the world who attempts to build gawk on MirBSD.
>
> While this is admirable on his part, I finally decided that I didn't want
> the headache of maintaining those changes.
>
> So - Caveat Emptor; you may be twisting your code base for the benefit
> of just a single system that's WAAAY out in left field.

Thanks for the heads up, Arnold.
Note that so far, none of the changes we're considering
are to the core parts of grep. Rather, they affect only
the portability layers provided by gnulib. As such,
any change we go with is likely to have no impact
on any system other than MirBSD or some
other system that has the same type of defect.

However, given that its mbrtowc function exhibits
the same problem, I'm inclined to write it off.

Information forwarded to bug-grep <at> gnu.org:
bug#23269; Package grep. (Mon, 18 Apr 2016 15:06:02 GMT) Full text and rfc822 format available.

Message #29 received at 23269 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>
Cc: Paul Eggert <eggert <at> twinsun.com>, 23269 <at> debbugs.gnu.org,
 "Nelson H. F. Beebe" <beebe <at> math.utah.edu>
Subject: Re: bug#23269: MirBSD 10 i386 test failures [Re: grep-2.24.13-bed6
 feedback
Date: Mon, 18 Apr 2016 08:05:34 -0700

On 04/18/2016 07:54 AM, Jim Meyering wrote:
> I'm sure you intended this additional change, so
> that the if-expression can sometimes be false:
>
> +      if (mbrtowc (&wc, buf, 1, &mbs) >= 0)
> -       if (mbrtowc (&wc, buf, 1, &mbs) < (size_t)-2)

Oh yes.  (Blush.)  Or it could be <= 1.

>
> with that, the btowc replacement function still
> declares bytes 128..255 to be invalid in the C
> locale.
>

Too bad. I'm afraid someone with access to MirBSD will need to debug it.

Information forwarded to bug-grep <at> gnu.org:
bug#23269; Package grep. (Tue, 19 Apr 2016 16:05:01 GMT) Full text and rfc822 format available.

Message #32 received at 23269 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>
Cc: Paul Eggert <eggert <at> twinsun.com>, arnold <at> skeeve.com, 23269 <at> debbugs.gnu.org,
 "Nelson H. F. Beebe" <beebe <at> math.utah.edu>
Subject: Re: bug#23269: MirBSD 10 i386 test failures [Re: grep-2.24.13-bed6
 feedback
Date: Tue, 19 Apr 2016 09:04:36 -0700

[Message part 1 (text/plain, inline)]

On 04/18/2016 08:05 AM, Paul Eggert wrote:
> 'm afraid someone with access to MirBSD will need to debug it. 

On second thought there is a simpler fix: stop using btowc. I installed 
the attached patch, which is a good idea anyway. By using only mbrtowc 
(which we need to use anyway), it avoids problems on misconfigured 
systems like MirOS BSD where btowc disagrees with mbrtowc.

After writing and debugging this patch I looked at Gawk and noticed that 
it already has its own equivalent of this patch's new mbrtowc_cache 
variable. Gawk obtains its cache via btowc; although this doesn't work 
on MirOS BSD due to its buggy btowc, Arnold says he's not worried about 
MirOS BSD any more which is quite understandable. Still, it's a bit odd 
to have two caches in Gawk that do the same thing; perhaps we can unify 
them at some point.

[0001-dfa-remove-dependency-on-btowc.patch (application/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#23269; Package grep. (Tue, 19 Apr 2016 21:10:02 GMT) Full text and rfc822 format available.

Message #35 received at 23269 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Paul Eggert <eggert <at> twinsun.com>, Aharon Robbins <arnold <at> skeeve.com>,
 23269 <at> debbugs.gnu.org, "Nelson H. F. Beebe" <beebe <at> math.utah.edu>
Subject: Re: bug#23269: MirBSD 10 i386 test failures [Re: grep-2.24.13-bed6
 feedback
Date: Tue, 19 Apr 2016 14:08:57 -0700

On Tue, Apr 19, 2016 at 9:04 AM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 04/18/2016 08:05 AM, Paul Eggert wrote:
>>
>> 'm afraid someone with access to MirBSD will need to debug it.
>
> On second thought there is a simpler fix: stop using btowc. I installed the
> attached patch, which is a good idea anyway. By using only mbrtowc (which we
> need to use anyway), it avoids problems on misconfigured systems like MirOS
> BSD where btowc disagrees with mbrtowc.
>
> After writing and debugging this patch I looked at Gawk and noticed that it
> already has its own equivalent of this patch's new mbrtowc_cache variable.
> Gawk obtains its cache via btowc; although this doesn't work on MirOS BSD
> due to its buggy btowc, Arnold says he's not worried about MirOS BSD any
> more which is quite understandable. Still, it's a bit odd to have two caches
> in Gawk that do the same thing; perhaps we can unify them at some point.

Oh! Very nice. Thanks yet again, Paul :-)

Information forwarded to bug-grep <at> gnu.org:
bug#23269; Package grep. (Wed, 20 Apr 2016 09:01:01 GMT) Full text and rfc822 format available.

Message #38 received at 23269 <at> debbugs.gnu.org (full text, mbox):

From: arnold <at> skeeve.com
To: jim <at> meyering.net, eggert <at> cs.ucla.edu
Cc: eggert <at> twinsun.com, arnold <at> skeeve.com, 23269 <at> debbugs.gnu.org,
 beebe <at> math.utah.edu
Subject: Re: bug#23269: MirBSD 10 i386 test failures [Re: grep-2.24.13-bed6
 feedback
Date: Wed, 20 Apr 2016 02:49:07 -0600

Jim Meyering <jim <at> meyering.net> wrote:

> On Tue, Apr 19, 2016 at 9:04 AM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> > On 04/18/2016 08:05 AM, Paul Eggert wrote:
> >>
> >> 'm afraid someone with access to MirBSD will need to debug it.
> >
> > On second thought there is a simpler fix: stop using btowc. I installed the
> > attached patch, which is a good idea anyway. By using only mbrtowc (which we
> > need to use anyway), it avoids problems on misconfigured systems like MirOS
> > BSD where btowc disagrees with mbrtowc.
> >
> > After writing and debugging this patch I looked at Gawk and noticed that it
> > already has its own equivalent of this patch's new mbrtowc_cache variable.
> > Gawk obtains its cache via btowc; although this doesn't work on MirOS BSD
> > due to its buggy btowc, Arnold says he's not worried about MirOS BSD any
> > more which is quite understandable. Still, it's a bit odd to have two caches
> > in Gawk that do the same thing; perhaps we can unify them at some point.
>
> Oh! Very nice. Thanks yet again, Paul :-)

Thanks Paul. I will merge that change into gawk.

I will then look into unifying the two single-byte-to-multibyte caches.
This will likely mean interface additions in dfa.h and some minor
code changes in dfa.c. I will submit a patch for review here before
committing in gawk.

Just to clarify, MirBSD is still supported in the "stable" code base
(gawk-4.1-stable branch in git), and I'm working on another release
from that branch that I hope will happen in the near future.  But for
the long term, yes, I don't care about MirBSD.  It's just too weird.
:-(

Thanks,

Arnold

Information forwarded to bug-grep <at> gnu.org:
bug#23269; Package grep. (Wed, 20 Apr 2016 16:41:02 GMT) Full text and rfc822 format available.

Message #41 received at 23269 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: arnold <at> skeeve.com, jim <at> meyering.net
Cc: 23269 <at> debbugs.gnu.org
Subject: Re: bug#23269: MirBSD 10 i386 test failures [Re: grep-2.24.13-bed6
 feedback
Date: Wed, 20 Apr 2016 09:40:05 -0700

On 04/20/2016 01:49 AM, arnold <at> skeeve.com wrote:
> This will likely mean interface additions in dfa.h and some minor
> code changes in dfa.c.

One thing that bugged me about dfa.c (when I was looking at this 
yesterday) is that it maintains some state in static variables, which 
means it can't be used in multiple threads using different locales. 
That's not an issue with grep or gawk now, but might be for other apps 
and might conceivably be a problem even in grep, which has a 
multithreaded patch pending and might conceivably want to use per-file 
encodings. So perhaps, while we're thinking about exposing the 
uni-to-multibyte cache anyway, we might want to look into fixing these 
other interface issues as well.

PS. I'm dropping eggert <at> twinsun.com from the CC: list, as that email 
address hasn't worked for many years....

Information forwarded to bug-grep <at> gnu.org:
bug#23269; Package grep. (Thu, 21 Apr 2016 00:29:02 GMT) Full text and rfc822 format available.

Message #44 received at 23269 <at> debbugs.gnu.org (full text, mbox):

From: sur-behoffski <sur_behoffski <at> grouse.com.au>
To: 23269 <at> debbugs.gnu.org
Subject: Multi-threaded operation, mbrtowc, and "untangle" script [was Re:
 bug#23269...]
Date: Thu, 21 Apr 2016 09:58:37 +0930

On 04/21/16 02:10, Paul Eggert wrote:
> On 04/20/2016 01:49 AM, arnold <at> skeeve.com wrote:
>> This will likely mean interface additions in dfa.h and some minor
>> code changes in dfa.c.
>
> One thing that bugged me about dfa.c (when I was looking at this yesterday) is that it maintains some state in static variables, which means it can't be used in multiple threads using different locales. That's not an issue with grep or gawk now, but might be for other apps and might conceivably be a problem even in grep, which has a multithreaded patch pending and might conceivably want to use per-file encodings. So perhaps, while we're thinking about exposing the uni-to-multibyte cache anyway, we might want to look into fixing these other interface issues as well.
>
> PS. I'm dropping eggert <at> twinsun.com from the CC: list, as that email address hasn't worked for many years....
>
>

G'day,

(Sobs quietly to self:)  One of the explicit design goals that I had
behind writing the "untangle" Lua script was to reduce or eliminate
static variables:  If I recall correctly (it's been 18 months since I
looked at this), I split earlier parts of dfa.c into:
     * charclass;
     * lexer; and
     * parser;

with the remaining dfa.c code (especially the search algorithm)
untouched as being in the "too hard" (for a first pass) basket.

Each of these had an explicit instance/context pointer, e.g. "class",
"lexer" or "parser", as appropriate, eliminating any static variables.
I believe the only exception to this, for a long time, was the handover
of {m,n} counts by static variables -- I ended up inventing a clumsy
"fence" interface so that the parser could explicitly fetch these
values from the opaque lexer context.

I kept updating the script after releases, but stopped when asked to,
as people felt that the signal/noise ratio in the list, resulting from
the regular releases of the script, was being reduced.  Since that
time, a few minor, obvious changes that I wrote in the untangle script
have appeared in patches by others.  A number of static variables have
been changed to being per-instance variables during this time, when
the code has been touched for other reasons, and the instance change
is easy to include.

(At the same time, there has been considerable activity in dfa.c
itself, so updating "untangle" would be a significant undertaking.)

As I was writing this at the time, I was thinking about having different
instances running in parallel, and I recall looking at mbrtowc in this
light.  There is a potential problem if multiple locales are desired:
Some locale-specific processing is done when the modules are first
initialised (e.g. setting up some tables), and mbrtowc itself is not
thread-safe, as it assumes a "current" locale.

So, I'm not sure if a thread-safe (i.e. locale-safe) version of mbrtowc
exists; if not, this needs to be addressed before a split-locale,
multi-threaded version is feasible.  (LC_CTYPE race conditions?)

cheers,

sur-behoffski (Brenton Hoff)
Programmer, Grouse Software

Information forwarded to bug-grep <at> gnu.org:
bug#23269; Package grep. (Thu, 21 Apr 2016 09:57:01 GMT) Full text and rfc822 format available.

Message #47 received at 23269 <at> debbugs.gnu.org (full text, mbox):

From: arnold <at> skeeve.com
To: sur_behoffski <at> grouse.com.au, 23269 <at> debbugs.gnu.org
Subject: Re: bug#23269: Multi-threaded operation, mbrtowc, and "untangle"
 script [was Re: bug#23269...]
Date: Thu, 21 Apr 2016 03:55:59 -0600

sur-behoffski <sur_behoffski <at> grouse.com.au> wrote:

> So, I'm not sure if a thread-safe (i.e. locale-safe) version of mbrtowc
> exists; if not, this needs to be addressed before a split-locale,
> multi-threaded version is feasible.  (LC_CTYPE race conditions?)

By definition, mbrtowc is thread safe.  The question relates better
to setlocale(), or rather to the underlying internal locale data. I don't
think the current POSIX model lends itself to multiple locales within
the same process.

I have to wonder if a multi-threaded grep makes sense in any case; it
would be a very suprising change in behavior if output from multiple
files comes out interleaved, instead of in the order the files were
specified on the command line.

My two cents, of course.

Thanks,

Arnold

Information forwarded to bug-grep <at> gnu.org:
bug#23269; Package grep. (Thu, 21 Apr 2016 14:44:02 GMT) Full text and rfc822 format available.

Message #50 received at 23269 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: arnold <at> skeeve.com, sur_behoffski <at> grouse.com.au, 23269 <at> debbugs.gnu.org
Subject: Re: bug#23269: Multi-threaded operation, mbrtowc, and "untangle"
 script [was Re: bug#23269...]
Date: Thu, 21 Apr 2016 07:43:29 -0700

On 04/21/2016 02:55 AM, arnold <at> skeeve.com wrote:
> I don't
> think the current POSIX model lends itself to multiple locales within
> the same process.

Although that was an issue years ago, is it still a problem with 
uselocale and the like?

> it
> would be a very suprising change in behavior if output from multiple
> files comes out interleaved, instead of in the order the files were
> specified on the command line.

I presume that computation is interleaved but the output order is the 
same as before, unless the user specifies an option saying speed trumps 
predictability.

Information forwarded to bug-grep <at> gnu.org:
bug#23269; Package grep. (Mon, 25 Apr 2016 00:19:02 GMT) Full text and rfc822 format available.

Message #53 received at 23269 <at> debbugs.gnu.org (full text, mbox):

From: Zev Weiss <zev <at> bewilderbeest.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: sur_behoffski <at> grouse.com.au, arnold <at> skeeve.com, 23269 <at> debbugs.gnu.org
Subject: Re: bug#23269: Multi-threaded operation, mbrtowc, and "untangle"
 script [was Re: bug#23269...]
Date: Sun, 24 Apr 2016 19:18:22 -0500

On Thu, Apr 21, 2016 at 07:43:29AM -0700, Paul Eggert wrote:
>On 04/21/2016 02:55 AM, arnold <at> skeeve.com wrote:
>>it
>>would be a very suprising change in behavior if output from multiple
>>files comes out interleaved, instead of in the order the files were
>>specified on the command line.
>
>I presume that computation is interleaved but the output order is the 
>same as before, unless the user specifies an option saying speed 
>trumps predictability.
>

For what it's worth, the command-line flag added by my multithreading 
patch series as it currently stands is pretty much that (speed over 
predictability).  In the interest of simplicity, it omits per-file 
output buffering and just outputs matching lines as they are found -- 
the non-determinism this introduces into its output is the reason it's 
left as an opt-in command-line flag and not on by default.

[Strictly speaking even in the default "single-threaded" mode it *is* in 
fact actually multi-threaded, but there's only one search thread, so 
output ordering is unaffected.  In theory even this could allow a slight 
performance improvement by overlapping pattern-matching with directory 
traversal and prefetching in the master thread, but I'd guess it's 
probably negligible in most cases, and isn't really the goal of the 
patches.]

Zev Weiss

Information forwarded to bug-grep <at> gnu.org:
bug#23269; Package grep. (Mon, 25 Apr 2016 05:51:02 GMT) Full text and rfc822 format available.

Message #56 received at 23269 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Zev Weiss <zev <at> bewilderbeest.net>
Cc: sur_behoffski <at> grouse.com.au, arnold <at> skeeve.com, 23269 <at> debbugs.gnu.org
Subject: Re: bug#23269: Multi-threaded operation, mbrtowc, and "untangle"
 script [was Re: bug#23269...]
Date: Sun, 24 Apr 2016 22:50:30 -0700

Zev Weiss wrote:
> [Strictly speaking even in the default "single-threaded" mode it *is* in fact
> actually multi-threaded, but there's only one search thread, so output ordering
> is unaffected.  In theory even this could allow a slight performance improvement
> by overlapping pattern-matching with directory traversal and prefetching in the
> master thread, but I'd guess it's probably negligible in most cases, and isn't
> really the goal of the patches.]

In the common case where a command like 'grep -r unusual' reads many files but
outputs few lines, I would think multiple search threads could work pretty well
even if the output is required to be deterministic.

Information forwarded to bug-grep <at> gnu.org:
bug#23269; Package grep. (Tue, 26 Apr 2016 04:13:01 GMT) Full text and rfc822 format available.

Message #59 received at 23269 <at> debbugs.gnu.org (full text, mbox):

From: sur-behoffski <sur_behoffski <at> grouse.com.au>
To: 23269 <at> debbugs.gnu.org
Subject: [Re:] bug#23269: Multi-threaded operation, mbrtowc [...]
Date: Tue, 26 Apr 2016 13:42:16 +0930

[The message below was originally sent only to Arnold, but I intended
it to go to 23269 <at> debbugs.gnu.org as well.  Seeing as the conversation
regarding multi-threaded grep operation is continuing, I've decided to
forward it to the bug list.  Apologies to Arnold (and others as
appropriate) if this is a duplicate.  -- sur-behoffski]


-------- Forwarded Message --------
Subject: Re: bug#23269: Multi-threaded operation, mbrtowc, and "untangle" script [was Re: bug#23269...]
Date: Thu, 21 Apr 2016 21:32:15 +0930
From: sur-behoffski <sur_behoffski <at> grouse.com.au>
To: arnold <at> skeeve.com

On 04/21/16 19:25, arnold <at> skeeve.com wrote:
> sur-behoffski <sur_behoffski <at> grouse.com.au> wrote:
>
>> So, I'm not sure if a thread-safe (i.e. locale-safe) version of mbrtowc
>> exists; if not, this needs to be addressed before a split-locale,
>> multi-threaded version is feasible.  (LC_CTYPE race conditions?)
>
> By definition, mbrtowc is thread safe.  The question relates better
> to setlocale(), or rather to the underlying internal locale data. I don't
> think the current POSIX model lends itself to multiple locales within
> the same process.
>

Thanks for the response.  As noted in the man pages, the thread safety
does not extend to multi-locale settings, and this is explicitly what Paul
was hoping for in the message that I replied to:

    On 04/21/16 02:10, Paul Eggert wrote:
    > [...]
    > One thing that bugged me about dfa.c (when I was looking at this
    > yesterday) is that it maintains some state in static variables, which
    > means it can't be used in multiple threads using different locales.
    > That's not an issue with grep or gawk now, but might be for other
    > apps and might conceivably be a problem even in grep, which has a
    > multithreaded patch pending and might conceivably want to use per-file
    > encodings. [...]


"man 3 mbrtowc" on my Gentoo system has the following text in the ATTRIBUTES,
CONFORMING TO, NOTES and COLOPHON sections:

------ (Start of excerpt) ------

ATTRIBUTES
       For an explanation of the terms used in this section, see attributes(7).

       +----------+---------------+----------------------------+
       |Interface | Attribute     | Value                      |
       +----------+---------------+----------------------------+
       |mbrtowc() | Thread safety | MT-Unsafe race:mbrtowc/!ps |
       +----------+---------------+----------------------------+
CONFORMING TO
       POSIX.1-2001, POSIX.1-2008, C99.

NOTES
       The behavior of mbrtowc() depends on the LC_CTYPE category of the current locale.

[...]

COLOPHON
       This  page  is  part of release 4.04 of the Linux man-pages project.  A description of the
       project, information about reporting bugs, and the latest version of  this  page,  can  be
       found at http://www.kernel.org/doc/man-pages/.

GNU                                         2015-08-08                                 MBRTOWC(3)

------ (End of excerpt) ------

cheers,

sur-behoffski (Brenton Hoff)
Programmer, Grouse Software

Information forwarded to bug-grep <at> gnu.org:
bug#23269; Package grep. (Mon, 02 May 2016 03:13:01 GMT) Full text and rfc822 format available.

Message #62 received at 23269 <at> debbugs.gnu.org (full text, mbox):

From: Aharon Robbins <arnold <at> skeeve.com>
To: eggert <at> cs.ucla.edu
Cc: sur_behoffski <at> grouse.com.au, arnold <at> skeeve.com, 23269 <at> debbugs.gnu.org
Subject: Re: bug#23269: Multi-threaded operation, mbrtowc,
 and "untangle" script [was Re: bug#23269...]
Date: Mon, 02 May 2016 06:12:13 +0300

> Subject: Re: bug#23269: Multi-threaded operation, mbrtowc, and "untangle"
>  script [was Re: bug#23269...]
> To: arnold <at> skeeve.com, sur_behoffski <at> grouse.com.au, 23269 <at> debbugs.gnu.org
> From: Paul Eggert <eggert <at> cs.ucla.edu>
> 
> On 04/21/2016 02:55 AM, arnold <at> skeeve.com wrote:
> > I don't
> > think the current POSIX model lends itself to multiple locales within
> > the same process.
> 
> Although that was an issue years ago, is it still a problem with 
> uselocale and the like?

I wasn't aware of uselocale, newlocale, and duplocale until now. It
looks like those solve the problem. Interesting!

Thanks,

Arnold

bug closed, send any further explanations to 23269 <at> debbugs.gnu.org and Jim Meyering <jim <at> meyering.net> Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Thu, 08 Sep 2016 08:25:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 06 Oct 2016 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 7 years and 197 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #23269 new snapshot available: grep-2.24.13-bed6

GNU bug report logs - #23269
new snapshot available: grep-2.24.13-bed6