GNU bug report logs - #35939
version sort is incorrect with hyphen-minus

Previous Next

Package: coreutils;

Reported by: Vincent Lefevre <vincent <at> vinc17.net>

Date: Tue, 28 May 2019 00:55:01 UTC

Severity: normal

To reply to this bug, email your comments to 35939 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#35939; Package coreutils. (Tue, 28 May 2019 00:55:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Vincent Lefevre <vincent <at> vinc17.net>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Tue, 28 May 2019 00:55:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: bug-coreutils <at> gnu.org
Subject: version sort is incorrect with hyphen-minus
Date: Tue, 28 May 2019 02:53:39 +0200
With GNU coreutils 8.30 under Debian/unstable, I get:

$ LC_ALL=C ls
ab-cd  abb  abe
$ LC_ALL=C ls -v
abb  abe  ab-cd

The hyphen-minus character should still be regarded as being less
than the letters (there are no digits, so both are expected to be
equivalent). The GNU coreutils manual says:

10.1.3 Sorting the output
-------------------------
[...]
‘-v’
‘--sort=version’
     Sort by version name and number, lowest first.  It behaves like a
     default sort, except that each sequence of decimal digits is
     treated numerically as an index/version number.  (*Note Details
     about version sort::.)

(which is exactly what I expect).

The "sort -V" command has the same issue.

Note: If I add two more files and compare with zsh:

zira% export LC_ALL=C
zira% ls
ab-cd  ab10  ab2  abb  abe
zira% ls -v
ab2  ab10  abb  abe  ab-cd
zira% echo *
ab-cd ab10 ab2 abb abe
zira% echo *(n)
ab-cd ab2 ab10 abb abe

one can see that zsh is correct, but Coreutils has an issue with the
hyphen-minus character.

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)




Information forwarded to bug-coreutils <at> gnu.org:
bug#35939; Package coreutils. (Wed, 26 Jun 2019 18:26:02 GMT) Full text and rfc822 format available.

Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Vincent Lefevre <vincent <at> vinc17.net>, bug-coreutils <at> gnu.org
Cc: Ian Jackson <ijackson <at> chiark.greenend.org.uk>
Subject: Re: bug#35939: version sort is incorrect with hyphen-minus
Date: Wed, 26 Jun 2019 12:25:26 -0600
(Adding Ian Jackson for dpkg/debian-version details)

Hello,

On Tue, May 28, 2019 at 02:53:39AM +0200, Vincent Lefevre wrote:
> With GNU coreutils 8.30 under Debian/unstable, I get:
> 
> $ LC_ALL=C ls
> ab-cd  abb  abe
> $ LC_ALL=C ls -v
> abb  abe  ab-cd
> 
> The hyphen-minus character should still be regarded as being less
> than the letters (there are no digits, so both are expected to be
> equivalent). The GNU coreutils manual says:
> 
[...]

Thanks for the report and the clear details.

To summarize,
"ls -v" and "sort -V" (coreutils' version sort) behaves differently than
other implementations in regards to minus character:

    $ printf "%s\n" abb ab-cd | sort -V
    abb
    ab-cd

    $ v1="abb"
    $ v2="ab-cd"
    $ dpkg --compare-versions "$v1" lt "$v2" && printf "$v1\n$v2\n" || printf "$v2\n$v1\n"
    ab-cd
    abb

If I understand correctly,
The reason is that in Debian's version comparison algorithm [1], the minus
character has a special meaning: it separates the "upstream version"
part from the "debian revision" part.

In Debian's implementation [2], a version string is first split into three
parts (epoch, upstream version, debian revision) using ":" for epoch
delimiter and "-" for revision delimiter. Only then the three parts are
compared, separately [3].

[1] https://www.debian.org/doc/debian-policy/ch-controlfields.html#version
[2] https://git.dpkg.org/cgit/dpkg/dpkg.git/tree/lib/dpkg/parsehelp.c#n191
[3] https://git.dpkg.org/cgit/dpkg/dpkg.git/tree/lib/dpkg/version.c#n140

On ther other hand, coreutils' implementation (from gnulib [4]) does not
break version string into three parts - it treats the entire string as a
single "upstream version" part.
The rules for sorting the "upstream version" string say:

  "... The lexical comparison is a comparison of ASCII values modified so
  that all the letters sort earlier than all the non-letters and so that a
  tilde sorts before anything" (from [1])

[4] https://git.savannah.gnu.org/cgit/gnulib.git/tree/lib/filevercmp.c

Therefore, dpkg first seprates "ab" from "cd", then compares "ab" to
"abb" - and 'ab' comes first;
Coreutils compare "ab-cd" to "abb" (or technically, just "ab-" to
"abb"), and because "letters sort earlier than all non-letters", "abb"
comes first.

I hope this helps explain the differences (I also hope this explanation is
correct, and I invite others to chime in).


regards,
 - assaf





Information forwarded to bug-coreutils <at> gnu.org:
bug#35939; Package coreutils. (Wed, 26 Jun 2019 19:58:01 GMT) Full text and rfc822 format available.

Message #11 received at 35939 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: vincent <at> vinc17.net
Cc: 35939 <at> debbugs.gnu.org, Assaf Gordon <assafgordon <at> gmail.com>,
 Ian Jackson <ijackson <at> chiark.greenend.org.uk>
Subject: Re: bug#35939: version sort is incorrect with hyphen-minus
Date: Wed, 26 Jun 2019 12:57:14 -0700
GNU sort uses the same algorithm as glibc strverscmp, and this algorithm has 
changed only once since strverscmp was added to glibc in 1997. The change was 
made in 2009, to fix this bug:

https://sourceware.org/bugzilla/show_bug.cgi?id=9913

Has the Debian version-comparison algorithm changed since 1997? If so, could you 
give details about the changes to the Debian algorithm? Perhaps glibc should be 
changed to stay consistent with Debian.




Information forwarded to bug-coreutils <at> gnu.org:
bug#35939; Package coreutils. (Wed, 26 Jun 2019 20:10:02 GMT) Full text and rfc822 format available.

Message #14 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Ian Jackson <ijackson <at> chiark.greenend.org.uk>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: bug-coreutils <at> gnu.org, Vincent Lefevre <vincent <at> vinc17.net>
Subject: Re: bug#35939: version sort is incorrect with hyphen-minus
Date: Wed, 26 Jun 2019 21:09:14 +0100
Assaf Gordon writes ("Re: bug#35939: version sort is incorrect with hyphen-minus"):
> Thanks for the report and the clear details.

Hi.  I haven't read the original report, but everything you say about
the behaviour of GNU coreutils and dpkg sounds correct.

This is perhaps an unfortunate wrinkle but I think it is right of
coreutils to use the "upstream part" of the dpkg algorithm.

> I hope this helps explain the differences (I also hope this explanation is
> correct, and I invite others to chime in).

I wonder if this could go in some manual somewhere.

Regards,
Ian.

-- 
Ian Jackson <ijackson <at> chiark.greenend.org.uk>   These opinions are my own.

If I emailed you from an address @fyvzl.net or @evade.org.uk, that is
a private address which bypasses my fierce spamfilter.




Information forwarded to bug-coreutils <at> gnu.org:
bug#35939; Package coreutils. (Wed, 26 Jun 2019 20:16:02 GMT) Full text and rfc822 format available.

Message #17 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: bug-coreutils <at> gnu.org, Ian Jackson <ijackson <at> chiark.greenend.org.uk>
Subject: Re: bug#35939: version sort is incorrect with hyphen-minus
Date: Wed, 26 Jun 2019 22:14:54 +0200
On 2019-06-26 12:25:26 -0600, Assaf Gordon wrote:
> "ls -v" and "sort -V" (coreutils' version sort) behaves differently than
> other implementations in regards to minus character:
> 
>     $ printf "%s\n" abb ab-cd | sort -V
>     abb
>     ab-cd
> 
>     $ v1="abb"
>     $ v2="ab-cd"
>     $ dpkg --compare-versions "$v1" lt "$v2" && printf "$v1\n$v2\n" || printf "$v2\n$v1\n"
>     ab-cd
>     abb
> 
> If I understand correctly,
> The reason is that in Debian's version comparison algorithm [1], the minus
> character has a special meaning: it separates the "upstream version"
> part from the "debian revision" part.

Note that I'm not using "ls -v" to sort version numbers, just
filenames (which can contain integers in decimal notation).

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)




Information forwarded to bug-coreutils <at> gnu.org:
bug#35939; Package coreutils. (Wed, 26 Jun 2019 21:26:02 GMT) Full text and rfc822 format available.

Message #20 received at 35939 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 35939 <at> debbugs.gnu.org, Assaf Gordon <assafgordon <at> gmail.com>,
 Ian Jackson <ijackson <at> chiark.greenend.org.uk>
Subject: Re: bug#35939: version sort is incorrect with hyphen-minus
Date: Wed, 26 Jun 2019 23:25:44 +0200
On 2019-06-26 12:57:14 -0700, Paul Eggert wrote:
> GNU sort uses the same algorithm as glibc strverscmp, and this algorithm has
> changed only once since strverscmp was added to glibc in 1997. The change
> was made in 2009, to fix this bug:
> 
> https://sourceware.org/bugzilla/show_bug.cgi?id=9913

Except that this bug report is wrong. But I've checked that
both "ls -v" and "sort -V" give the expected ordering on the
given example in this bug report:

zira% ls -1v                
B007502280067.gbp.corp.com
B007502357019.GBP.CORP.COM
B0075022800016.gbp.corp.com

zira% printf "%s\n" * | sort -V
B007502280067.gbp.corp.com
B007502357019.GBP.CORP.COM
B0075022800016.gbp.corp.com

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)




Information forwarded to bug-coreutils <at> gnu.org:
bug#35939; Package coreutils. (Wed, 26 Jun 2019 23:02:01 GMT) Full text and rfc822 format available.

Message #23 received at 35939 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 35939 <at> debbugs.gnu.org, vincent <at> vinc17.net,
 Ian Jackson <ijackson <at> chiark.greenend.org.uk>
Subject: Re: bug#35939: version sort is incorrect with hyphen-minus
Date: Wed, 26 Jun 2019 17:01:29 -0600
Hello Paul,

On Wed, Jun 26, 2019 at 12:57:14PM -0700, Paul Eggert wrote:
> GNU sort uses the same algorithm as glibc strverscmp,

I think that both sort and ls use 'filevercmp' - a simplified version
that does not support locales (and doesn't fail).

The change (from 'strvercmp') was made in:

  commit e505736f8211a608b00dfe75fb186a5211e1a183
  Author: Kamil Dudka <kdudka <at> redhat.com>
  Date:   Fri Oct 3 11:03:40 2008 +0200
  ls and sort: use filevercmp instead of strverscmp
  https://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=e505736f8211a608b00dfe75fb186a5211e1a183

> Has the Debian version-comparison algorithm changed since 1997? If so, could
> you give details about the changes to the Debian algorithm?

I don't think the algorithm changed in Debian,
and also in gnulib there are only a handful of relevant commits, all 10
years old:

  9121662f1 2008-10-03 filevercmp: new module
  0443c2f39 2009-03-05 filevercmp: Move hidden files up in ordering.
  1721cf06d 2009-03-24 filevercmp: handle simple~ and numbered.~3~ backup suffixes
  4fd008794 2009-04-09 filevercmp: fix regression
  cc96df30d 2009-04-09 filevercmp: correct today's change

I think (also based on Ian's confirmation) that this discrepancy was
from the beginning.

I now notice that there's an additional difference: coreutils/gnulib has
special handling for extension, hidden files and backup files.

As Ian wrote, a documentation improvement is probably the best fix.
I'll try to come up with a suggested change.

-assaf

P.S.

For completion, here are few other threads with details/explanations
about 'version-sort':
https://bugs.gnu.org/18168
https://bugs.gnu.org/22275
https://bugs.gnu.org/22455
https://bugs.gnu.org/33786




Information forwarded to bug-coreutils <at> gnu.org:
bug#35939; Package coreutils. (Wed, 26 Jun 2019 23:50:02 GMT) Full text and rfc822 format available.

Message #26 received at 35939 <at> debbugs.gnu.org (full text, mbox):

From: Ian Jackson <ijackson <at> chiark.greenend.org.uk>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 35939 <at> debbugs.gnu.org, Assaf Gordon <assafgordon <at> gmail.com>,
 vincent <at> vinc17.net
Subject: Re: bug#35939: version sort is incorrect with hyphen-minus
Date: Thu, 27 Jun 2019 00:49:19 +0100
Paul Eggert writes ("Re: bug#35939: version sort is incorrect with hyphen-minus"):
> GNU sort uses the same algorithm as glibc strverscmp, and this algorithm has 
> changed only once since strverscmp was added to glibc in 1997. The change was 
> made in 2009, to fix this bug:
> 
> https://sourceware.org/bugzilla/show_bug.cgi?id=9913
> 
> Has the Debian version-comparison algorithm changed since 1997? If so, could you 
> give details about the changes to the Debian algorithm? Perhaps glibc should be 
> changed to stay consistent with Debian.

Debian introduced a special (and very useful) meaning for ~, many
years ago now.

I checked the Debian policy manual and according to its upgrading
checklist this change was made in 2007.

Ian.

-- 
Ian Jackson <ijackson <at> chiark.greenend.org.uk>   These opinions are my own.

If I emailed you from an address @fyvzl.net or @evade.org.uk, that is
a private address which bypasses my fierce spamfilter.




Information forwarded to bug-coreutils <at> gnu.org:
bug#35939; Package coreutils. (Wed, 26 Jun 2019 23:55:02 GMT) Full text and rfc822 format available.

Message #29 received at 35939 <at> debbugs.gnu.org (full text, mbox):

From: Ian Jackson <ijackson <at> chiark.greenend.org.uk>
To: Paul Eggert <eggert <at> cs.ucla.edu>, vincent <at> vinc17.net,
 Assaf Gordon <assafgordon <at> gmail.com>, 35939 <at> debbugs.gnu.org
Subject: Re: bug#35939: version sort is incorrect with hyphen-minus
Date: Thu, 27 Jun 2019 00:54:20 +0100
Ian Jackson writes ("Re: bug#35939: version sort is incorrect with hyphen-minus"):
> Paul Eggert writes ("Re: bug#35939: version sort is incorrect with hyphen-minus"):
> > GNU sort uses the same algorithm as glibc strverscmp, and this algorithm has 
> > changed only once since strverscmp was added to glibc in 1997. The change was 
> > made in 2009, to fix this bug:
> > 
> > https://sourceware.org/bugzilla/show_bug.cgi?id=9913
> > 
> > Has the Debian version-comparison algorithm changed since 1997? If so, could you 
> > give details about the changes to the Debian algorithm? Perhaps glibc should be 
> > changed to stay consistent with Debian.
> 
> Debian introduced a special (and very useful) meaning for ~, many
> years ago now.
> 
> I checked the Debian policy manual and according to its upgrading
> checklist this change was made in 2007.

I have just checked the manpage I have here for strverscmp and it is
far from clear to me that the algorithm described there, and the dpkg
algorithm, produce the same answers.  (Even disregarding ~, and the
fact that the specification of the dpkg algorithm is defined only over
a subset of possible strings even though the unique extension to UTF-8
strings is fairly obvious.)

-- 
Ian Jackson <ijackson <at> chiark.greenend.org.uk>   These opinions are my own.

If I emailed you from an address @fyvzl.net or @evade.org.uk, that is
a private address which bypasses my fierce spamfilter.




Information forwarded to bug-coreutils <at> gnu.org:
bug#35939; Package coreutils. (Thu, 27 Jun 2019 01:41:02 GMT) Full text and rfc822 format available.

Message #32 received at 35939 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Ian Jackson <ijackson <at> chiark.greenend.org.uk>, vincent <at> vinc17.net,
 Assaf Gordon <assafgordon <at> gmail.com>, 35939 <at> debbugs.gnu.org
Subject: Re: bug#35939: version sort is incorrect with hyphen-minus
Date: Wed, 26 Jun 2019 18:40:50 -0700
Thanks for looking into this. Sorry about my confusion between 
strverscmp and filevercmp. As this bug report appears to be about 
filevercmp, glibc is not involved; it's only Gnulib and the utilities 
using Gnulib's filevercmp module.

As I now understand it, Gnulib filevercmp is intended to be consistent 
with Debian's version comparison (this is documented in filevercmp.c), 
so GNU Bug#35939 is therefore based on a misunderstanding, as Gnulib 
filevercmp is implementing the Debian spec correctly for this test case.

Perhaps the coreutils manual could be improved to make this all clearer, 
and perhaps it should refer to the Debian manual if it doesn't already.





Information forwarded to bug-coreutils <at> gnu.org:
bug#35939; Package coreutils. (Thu, 27 Jun 2019 09:37:02 GMT) Full text and rfc822 format available.

Message #35 received at 35939 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 35939 <at> debbugs.gnu.org, Assaf Gordon <assafgordon <at> gmail.com>,
 Ian Jackson <ijackson <at> chiark.greenend.org.uk>
Subject: Re: bug#35939: version sort is incorrect with hyphen-minus
Date: Thu, 27 Jun 2019 11:36:40 +0200
On 2019-06-26 18:40:50 -0700, Paul Eggert wrote:
> Perhaps the coreutils manual could be improved to make this all clearer, and
> perhaps it should refer to the Debian manual if it doesn't already.

In this case, there should be a new ordering option to provide
true numeric sort with strings mixing non-negative integers and
characters.

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)




Information forwarded to bug-coreutils <at> gnu.org:
bug#35939; Package coreutils. (Thu, 27 Jun 2019 10:26:01 GMT) Full text and rfc822 format available.

Message #38 received at 35939 <at> debbugs.gnu.org (full text, mbox):

From: Ian Jackson <ijackson <at> chiark.greenend.org.uk>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: Assaf Gordon <assafgordon <at> gmail.com>, Paul Eggert <eggert <at> cs.ucla.edu>,
 35939 <at> debbugs.gnu.org
Subject: Re: bug#35939: version sort is incorrect with hyphen-minus
Date: Thu, 27 Jun 2019 11:25:01 +0100
Vincent Lefevre writes ("Re: bug#35939: version sort is incorrect with hyphen-minus"):
> On 2019-06-26 18:40:50 -0700, Paul Eggert wrote:
> > Perhaps the coreutils manual could be improved to make this all clearer, and
> > perhaps it should refer to the Debian manual if it doesn't already.
> 
> In this case, there should be a new ordering option to provide
> true numeric sort with strings mixing non-negative integers and
> characters.

I think the Debian algorithm is such an algorithm, but it has a
wrinkle which you are not expecting.  Here is the specification:
  https://www.debian.org/doc/debian-policy/ch-controlfields.html#version

Note in particular
  | The lexical comparison is a comparison of ASCII values modified so
  | that all the letters sort earlier than all the non-letters and so
  | that a tilde sorts before anything, even the end of a part

So in the Debian algorithm, `-' sorts after `a'.  I specified this
rule.  I did it mainly because of versions like `1.0beta3', which is
is probably a prerelease of `1.0' and therefore earlier than `1.0.3'.
So `b' has to sort before `.' and my rule seemed the simplest one to
achieve that.  (The version comparison algorithm is a tradeoff between
complexity, and breadth of support for people's then-existing
practices.)  Nowadays Debian invariably writes `1.0~beta3' but when I
invented this scheme I did not include the (invaluable) `~' feature.

When this is extended to UTF-8, presumably the ordering should be an
ordering of unicode scalar values, with the rule about letters
interpreted as referring to anything which Unicode considers a letter.

If you want to test the Debian algorithm and have access to a copy of
dpkg, you can append -1 to both strings to be the "Debian revision",
and prepend "1:" to be the "epoch", and then the middle part should be
compared the same way as sort -V etc.

Vincent, what is your use case for a comparison algorithm which is
like the Debian one but which sorts letters after punctuation ?

Ian.

-- 
Ian Jackson <ijackson <at> chiark.greenend.org.uk>   These opinions are my own.

If I emailed you from an address @fyvzl.net or @evade.org.uk, that is
a private address which bypasses my fierce spamfilter.




Information forwarded to bug-coreutils <at> gnu.org:
bug#35939; Package coreutils. (Fri, 28 Jun 2019 19:19:02 GMT) Full text and rfc822 format available.

Message #41 received at 35939 <at> debbugs.gnu.org (full text, mbox):

From: Florian Weimer <fweimer <at> redhat.com>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: 35939 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>,
 Assaf Gordon <assafgordon <at> gmail.com>,
 Ian Jackson <ijackson <at> chiark.greenend.org.uk>
Subject: Re: bug#35939: version sort is incorrect with hyphen-minus
Date: Fri, 28 Jun 2019 21:18:26 +0200
* Vincent Lefevre:

> On 2019-06-26 18:40:50 -0700, Paul Eggert wrote:
>> Perhaps the coreutils manual could be improved to make this all clearer, and
>> perhaps it should refer to the Debian manual if it doesn't already.
>
> In this case, there should be a new ordering option to provide
> true numeric sort with strings mixing non-negative integers and
> characters.

There's no one true numeric sort.  Some versioning schemes interpret
numbers after a dot as decimal fractions (so that 1.9 > 1.10), but it's
more common to split version strings into tuple somehow and then sort
the non-numeric parts lexicographically, and the numeric parts as
integers (so that 1.9 < 1.10).

Thanks,
Florian




This bug report was last modified 4 years and 303 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.