GNU bug report logs - #26779
expr length

Previous Next

Package: coreutils;

Reported by: Андрей Воронов <voronov <at> factor-ts.ru>

Date: Thu, 4 May 2017 15:24:02 UTC

Severity: normal

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 26779 in the body.
You can then email your comments to 26779 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#26779; Package coreutils. (Thu, 04 May 2017 15:24:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Андрей Воронов <voronov <at> factor-ts.ru>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Thu, 04 May 2017 15:24:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Андрей Воронов <voronov <at> factor-ts.ru>
To: bug-coreutils <at> gnu.org
Subject: expr length
Date: Thu, 4 May 2017 12:59:31 +0300
Hello,

I have the bug in expr utility when it perform operation of the 
calculating length of the string in my multi-byte encoding ru_RU.UTF-8.
When I run:  "expr length Привет"
it return me the number 12.
But in the russian word "Hello" the length is 2 times shorter: exactly 6 
characters.
May you register and fix this problem?

expr --version
expr (GNU coreutils) 8.25

God bless you






Information forwarded to bug-coreutils <at> gnu.org:
bug#26779; Package coreutils. (Thu, 04 May 2017 15:44:02 GMT) Full text and rfc822 format available.

Message #8 received at 26779 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Андрей Воронов <voronov <at> factor-ts.ru>,
 26779 <at> debbugs.gnu.org
Subject: Re: bug#26779: expr length
Date: Thu, 4 May 2017 08:43:19 -0700
On 04/05/17 02:59, Андрей Воронов wrote:
> Hello,
> 
> I have the bug in expr utility when it perform operation of the 
> calculating length of the string in my multi-byte encoding ru_RU.UTF-8.
> When I run:  "expr length Привет"
> it return me the number 12.
> But in the russian word "Hello" the length is 2 times shorter: exactly 6 
> characters.
> May you register and fix this problem?

expr is listed in the plan here:
http://www.pixelbeat.org/docs/coreutils_i18n/

thanks
Pádraig




Information forwarded to bug-coreutils <at> gnu.org:
bug#26779; Package coreutils. (Tue, 09 May 2017 07:06:02 GMT) Full text and rfc822 format available.

Message #11 received at 26779 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Pádraig Brady <P <at> draigBrady.com>
Cc: 26779 <at> debbugs.gnu.org,
 Андрей Воронов <voronov <at> factor-ts.ru>
Subject: Re: bug#26779: expr length
Date: Tue, 9 May 2017 03:04:56 -0400
[Message part 1 (text/plain, inline)]
Hello,

> On May 4, 2017, at 11:43, Pádraig Brady <P <at> draigBrady.com> wrote:
> 
> On 04/05/17 02:59, Андрей Воронов wrote:
>> I have the bug in expr utility when it perform operation of the 
>> calculating length of the string in my multi-byte encoding ru_RU.UTF-8.
> 
> expr is listed in the plan here:
> http://www.pixelbeat.org/docs/coreutils_i18n/

Attached a draft patch implementing multibyte support for 'expr'
(it doesn't need any code from my previous multibyte stuff, so I'm sending it separately).

Specifically, the length/index/substr operators are adjusted.
The regex engine for the 'match' operator already supported multibyte characters (only minor adjustment needed to return matched character count instead of matched byte count).
The string comparison already used 'strcoll' so I assumed they work with multibyte strings.

Comments welcomed,
 - assaf


[0001-expr-add-multibyte-support.patch.gz (application/x-gzip, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#26779; Package coreutils. (Tue, 09 May 2017 07:27:01 GMT) Full text and rfc822 format available.

Message #14 received at 26779 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Pádraig Brady <P <at> draigBrady.com>
Cc: 26779 <at> debbugs.gnu.org,
 Андрей Воронов <voronov <at> factor-ts.ru>
Subject: Re: bug#26779: expr length
Date: Tue, 9 May 2017 03:26:12 -0400
> On May 9, 2017, at 03:04, Assaf Gordon <assafgordon <at> gmail.com> wrote:
> 
> Attached a draft patch implementing multibyte support for 'expr'
> (it doesn't need any code from my previous multibyte stuff, so I'm sending it separately).
> [...]
> 
> <0001-expr-add-multibyte-support.patch.gz>

Re-reading the patch, I see there's a memory leak when 'mbs_logical_substr' is called.
I'll send updated version soon (if you have more comments, please send).





Information forwarded to bug-coreutils <at> gnu.org:
bug#26779; Package coreutils. (Tue, 09 May 2017 12:32:02 GMT) Full text and rfc822 format available.

Message #17 received at 26779 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: 26779 <at> debbugs.gnu.org,
 Андрей Воронов <voronov <at> factor-ts.ru>
Subject: Re: bug#26779: expr length
Date: Tue, 9 May 2017 05:31:03 -0700
On 09/05/17 00:04, Assaf Gordon wrote:
> Hello,
> 
>> On May 4, 2017, at 11:43, Pádraig Brady <P <at> draigBrady.com> wrote:
>>
>> On 04/05/17 02:59, Андрей Воронов wrote:
>>> I have the bug in expr utility when it perform operation of the 
>>> calculating length of the string in my multi-byte encoding ru_RU.UTF-8.
>>
>> expr is listed in the plan here:
>> http://www.pixelbeat.org/docs/coreutils_i18n/
> 
> Attached a draft patch implementing multibyte support for 'expr'
> (it doesn't need any code from my previous multibyte stuff, so I'm sending it separately).
> 
> Specifically, the length/index/substr operators are adjusted.
> The regex engine for the 'match' operator already supported multibyte characters (only minor adjustment needed to return matched character count instead of matched byte count).
> The string comparison already used 'strcoll' so I assumed they work with multibyte strings.

Definitely needs a NEWS entry
and mention of this bug in the commit message.

Perf isn't a huge concern for expr use cases,
so I'd rather not address in this patch if at all,
but for future reference it might be nice to pass in
to mbschr() etc. whether the current locale is UTF8.
Maybe some global similar to MB_CUR_MAX.  Then mbschr()
could be optimized in the UTF8 case as per pseudo code at:
http://www.pixelbeat.org/docs/utf8_programming.html

s/sequnce/sequence/

I notice expr isn't handled in the rhat/suse i18n patch,
so there is nothing to consider there.

Excellent work on the tests.

I've updated status in http://www.pixelbeat.org/docs/coreutils_i18n/
as this is ready to land I think.

thanks!
Pádraig




Information forwarded to bug-coreutils <at> gnu.org:
bug#26779; Package coreutils. (Thu, 11 May 2017 01:25:02 GMT) Full text and rfc822 format available.

Message #20 received at 26779 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Assaf Gordon <assafgordon <at> gmail.com>, Pádraig Brady
 <P <at> draigBrady.com>
Cc: 26779 <at> debbugs.gnu.org,
 Андрей Воронов <voronov <at> factor-ts.ru>
Subject: Re: bug#26779: expr length
Date: Wed, 10 May 2017 18:24:34 -0700
A couple of things. First, since performance doesn't matter here, wouldn't it be 
simpler to convert to wide character form by using mbstowcs, do the work in 
wide-character form, and then convert back via wcstombs? The resulting code 
should be easier to maintain, I'd think.

Also, please stick to GNU style for operator spacing and parenthesization.

Thanks.




Information forwarded to bug-coreutils <at> gnu.org:
bug#26779; Package coreutils. (Fri, 19 May 2017 04:11:01 GMT) Full text and rfc822 format available.

Message #23 received at 26779 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>,
 Pádraig Brady <P <at> draigBrady.com>
Cc: 26779 <at> debbugs.gnu.org,
 Андрей Воронов <voronov <at> factor-ts.ru>
Subject: Re: bug#26779: expr length
Date: Fri, 19 May 2017 00:09:58 -0400
[Message part 1 (text/plain, inline)]
Hello,

Attached an updated version of the 'expr/multibyte' patch.

> On May 9, 2017, at 08:31, Pádraig Brady <P <at> draigBrady.com> wrote:
> Definitely needs a NEWS entry
> and mention of this bug in the commit message.

I added two NEWS entries: one for the bug fix and one for the new multibyte support.

I could not find an exact version where this bug was 'introduced'.
'expr' never had a notion of multibyte strings, it was gnulib's "re_match"
which gained multibyte support.
The earliest gnulib mention of re_match and multibyte I could find is this:
  commit 42f97e3f0c576afb94226c8493f3567b44352ca8
  Author: Richard Stallman <rms <at> gnu.org>
  Date:   Fri Apr 3 07:33:13 1998 +0000

> On May 10, 2017, at 21:24, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> 
> A couple of things. First, since performance doesn't matter here, wouldn't it be simpler to convert to wide character form by using mbstowcs, do the work in wide-character form, and then convert back via wcstombs? The resulting code should be easier to maintain, I'd think.

The downside of 'mbstowcs' is that it fails and does not continue
when it encounters invalid multibyte sequences, whereas keeping
the string as 'char*' and using mbschr (with the additional code)
does handle invalid sequences fine (and the tests include such cases).

What do you think ?
I can work on a different patch with wide-char strings if this is deemed worthwhile.


> Also, please stick to GNU style for operator spacing and parenthesization.

I fixed the one case I spotted - if I missed any, please let me know
("make syntax-check" passes, but there could be others).

regards,
 - assaf


[0001-expr-add-multibyte-support.patch.gz (application/x-gzip, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#26779; Package coreutils. (Fri, 19 May 2017 05:48:02 GMT) Full text and rfc822 format available.

Message #26 received at 26779 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Assaf Gordon <assafgordon <at> gmail.com>, Pádraig Brady
 <P <at> draigBrady.com>
Cc: 26779 <at> debbugs.gnu.org,
 Андрей Воронов <voronov <at> factor-ts.ru>
Subject: Re: bug#26779: expr length
Date: Thu, 18 May 2017 22:47:33 -0700
Assaf Gordon wrote:

> The downside of 'mbstowcs' is that it fails and does not continue
> when it encounters invalid multibyte sequences,

It's OK for expr to report an error and exit in that case, which should be simple.

It's not a big deal.

> I fixed the one case I spotted - if I missed any, please let me know
> ("make syntax-check" passes, but there could be others).

char* -> char *

MB_CUR_MAX==1 -> MB_CUR_MAX == 1

mbs_logical_cspn ('hello','a') -> mbs_logical_cspn ('hello', 'a') [in a comment]

const char *_string -> const char *string

* mbui_cur_ptr (iter) -> *mbui_cur_ptr (iter)

(s+1) -> s + 1

Put */ at end of last line of comment, not at start of next.

MB_CUR_MAX>1 -> MB_CUR_MAX > 1

xmalloc ( blen + 1 ) -> xmalloc (blen + 1)

idx<pos -> idx < pos

mempcpy ( vlim -> mempcpy (vlim

d>=ofs -> d >= ofs

size_t c=0; -> size_t c = 0;

No need to use "const" on locals, unless perhaps you take their address.

Probably more instances; got tired of looking....




Information forwarded to bug-coreutils <at> gnu.org:
bug#26779; Package coreutils. (Wed, 28 Jun 2017 01:42:01 GMT) Full text and rfc822 format available.

Message #29 received at 26779 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 26779 <at> debbugs.gnu.org,
 Андрей Воронов <voronov <at> factor-ts.ru>,
 Pádraig Brady <P <at> draigBrady.com>
Subject: Re: bug#26779: expr length
Date: Wed, 28 Jun 2017 01:41:05 +0000
[Message part 1 (text/plain, inline)]
Hi,

Attached work-in-progress for expr/multibyte,
hopefully fixing all (most?) of the coding style issues
(Internal conversion to wide strings not yet implemented).

Comments welcomed,
- assaf
[0001-expr-add-multibyte-support.patch.gz (application/gzip, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#26779; Package coreutils. (Wed, 28 Jun 2017 16:34:01 GMT) Full text and rfc822 format available.

Message #32 received at 26779 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: 26779 <at> debbugs.gnu.org,
 Андрей Воронов <voronov <at> factor-ts.ru>
Subject: Re: bug#26779: expr length
Date: Wed, 28 Jun 2017 09:32:57 -0700
On 27/06/17 18:41, Assaf Gordon wrote:
> Hi,
> 
> Attached work-in-progress for expr/multibyte,
> hopefully fixing all (most?) of the coding style issues
> (Internal conversion to wide strings not yet implemented).
> 
> Comments welcomed,
>  - assaf
> 

Excellent work.
This looks to have sufficient testing and is ready to push.
We can look at using wide strings internally in a further patch.

thanks!
Pádraig.




Reply sent to Assaf Gordon <assafgordon <at> gmail.com>:
You have taken responsibility. (Thu, 29 Jun 2017 01:21:02 GMT) Full text and rfc822 format available.

Notification sent to Андрей Воронов <voronov <at> factor-ts.ru>:
bug acknowledged by developer. (Thu, 29 Jun 2017 01:21:02 GMT) Full text and rfc822 format available.

Message #37 received at 26779-done <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Pádraig Brady <P <at> draigBrady.com>
Cc: 26779-done <at> debbugs.gnu.org,
 Андрей Воронов <voronov <at> factor-ts.ru>
Subject: Re: bug#26779: expr length
Date: Wed, 28 Jun 2017 21:20:05 -0400
> On Jun 28, 2017, at 12:32, Pádraig Brady <P <at> draigBrady.com> wrote:
> 
> On 27/06/17 18:41, Assaf Gordon wrote:
>> Attached work-in-progress for expr/multibyte,

> This looks to have sufficient testing and is ready to push.
> We can look at using wide strings internally in a further patch.

Thanks,
pushed here:
  https://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=a9f2be5bfec2bfe86c08

I'm thus closing this bug report (but of course
discussion can continue by replying to this thread).

regards,
 - assaf






bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 27 Jul 2017 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 6 years and 272 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.