GNU bug report logs -
#35939
version sort is incorrect with hyphen-minus
Previous Next
To reply to this bug, email your comments to 35939 AT debbugs.gnu.org.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#35939
; Package
coreutils
.
(Tue, 28 May 2019 00:55:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Vincent Lefevre <vincent <at> vinc17.net>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Tue, 28 May 2019 00:55:01 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
With GNU coreutils 8.30 under Debian/unstable, I get:
$ LC_ALL=C ls
ab-cd abb abe
$ LC_ALL=C ls -v
abb abe ab-cd
The hyphen-minus character should still be regarded as being less
than the letters (there are no digits, so both are expected to be
equivalent). The GNU coreutils manual says:
10.1.3 Sorting the output
-------------------------
[...]
‘-v’
‘--sort=version’
Sort by version name and number, lowest first. It behaves like a
default sort, except that each sequence of decimal digits is
treated numerically as an index/version number. (*Note Details
about version sort::.)
(which is exactly what I expect).
The "sort -V" command has the same issue.
Note: If I add two more files and compare with zsh:
zira% export LC_ALL=C
zira% ls
ab-cd ab10 ab2 abb abe
zira% ls -v
ab2 ab10 abb abe ab-cd
zira% echo *
ab-cd ab10 ab2 abb abe
zira% echo *(n)
ab-cd ab2 ab10 abb abe
one can see that zsh is correct, but Coreutils has an issue with the
hyphen-minus character.
--
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#35939
; Package
coreutils
.
(Wed, 26 Jun 2019 18:26:02 GMT)
Full text and
rfc822 format available.
Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):
(Adding Ian Jackson for dpkg/debian-version details)
Hello,
On Tue, May 28, 2019 at 02:53:39AM +0200, Vincent Lefevre wrote:
> With GNU coreutils 8.30 under Debian/unstable, I get:
>
> $ LC_ALL=C ls
> ab-cd abb abe
> $ LC_ALL=C ls -v
> abb abe ab-cd
>
> The hyphen-minus character should still be regarded as being less
> than the letters (there are no digits, so both are expected to be
> equivalent). The GNU coreutils manual says:
>
[...]
Thanks for the report and the clear details.
To summarize,
"ls -v" and "sort -V" (coreutils' version sort) behaves differently than
other implementations in regards to minus character:
$ printf "%s\n" abb ab-cd | sort -V
abb
ab-cd
$ v1="abb"
$ v2="ab-cd"
$ dpkg --compare-versions "$v1" lt "$v2" && printf "$v1\n$v2\n" || printf "$v2\n$v1\n"
ab-cd
abb
If I understand correctly,
The reason is that in Debian's version comparison algorithm [1], the minus
character has a special meaning: it separates the "upstream version"
part from the "debian revision" part.
In Debian's implementation [2], a version string is first split into three
parts (epoch, upstream version, debian revision) using ":" for epoch
delimiter and "-" for revision delimiter. Only then the three parts are
compared, separately [3].
[1] https://www.debian.org/doc/debian-policy/ch-controlfields.html#version
[2] https://git.dpkg.org/cgit/dpkg/dpkg.git/tree/lib/dpkg/parsehelp.c#n191
[3] https://git.dpkg.org/cgit/dpkg/dpkg.git/tree/lib/dpkg/version.c#n140
On ther other hand, coreutils' implementation (from gnulib [4]) does not
break version string into three parts - it treats the entire string as a
single "upstream version" part.
The rules for sorting the "upstream version" string say:
"... The lexical comparison is a comparison of ASCII values modified so
that all the letters sort earlier than all the non-letters and so that a
tilde sorts before anything" (from [1])
[4] https://git.savannah.gnu.org/cgit/gnulib.git/tree/lib/filevercmp.c
Therefore, dpkg first seprates "ab" from "cd", then compares "ab" to
"abb" - and 'ab' comes first;
Coreutils compare "ab-cd" to "abb" (or technically, just "ab-" to
"abb"), and because "letters sort earlier than all non-letters", "abb"
comes first.
I hope this helps explain the differences (I also hope this explanation is
correct, and I invite others to chime in).
regards,
- assaf
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#35939
; Package
coreutils
.
(Wed, 26 Jun 2019 19:58:01 GMT)
Full text and
rfc822 format available.
Message #11 received at 35939 <at> debbugs.gnu.org (full text, mbox):
GNU sort uses the same algorithm as glibc strverscmp, and this algorithm has
changed only once since strverscmp was added to glibc in 1997. The change was
made in 2009, to fix this bug:
https://sourceware.org/bugzilla/show_bug.cgi?id=9913
Has the Debian version-comparison algorithm changed since 1997? If so, could you
give details about the changes to the Debian algorithm? Perhaps glibc should be
changed to stay consistent with Debian.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#35939
; Package
coreutils
.
(Wed, 26 Jun 2019 20:10:02 GMT)
Full text and
rfc822 format available.
Message #14 received at submit <at> debbugs.gnu.org (full text, mbox):
Assaf Gordon writes ("Re: bug#35939: version sort is incorrect with hyphen-minus"):
> Thanks for the report and the clear details.
Hi. I haven't read the original report, but everything you say about
the behaviour of GNU coreutils and dpkg sounds correct.
This is perhaps an unfortunate wrinkle but I think it is right of
coreutils to use the "upstream part" of the dpkg algorithm.
> I hope this helps explain the differences (I also hope this explanation is
> correct, and I invite others to chime in).
I wonder if this could go in some manual somewhere.
Regards,
Ian.
--
Ian Jackson <ijackson <at> chiark.greenend.org.uk> These opinions are my own.
If I emailed you from an address @fyvzl.net or @evade.org.uk, that is
a private address which bypasses my fierce spamfilter.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#35939
; Package
coreutils
.
(Wed, 26 Jun 2019 20:16:02 GMT)
Full text and
rfc822 format available.
Message #17 received at submit <at> debbugs.gnu.org (full text, mbox):
On 2019-06-26 12:25:26 -0600, Assaf Gordon wrote:
> "ls -v" and "sort -V" (coreutils' version sort) behaves differently than
> other implementations in regards to minus character:
>
> $ printf "%s\n" abb ab-cd | sort -V
> abb
> ab-cd
>
> $ v1="abb"
> $ v2="ab-cd"
> $ dpkg --compare-versions "$v1" lt "$v2" && printf "$v1\n$v2\n" || printf "$v2\n$v1\n"
> ab-cd
> abb
>
> If I understand correctly,
> The reason is that in Debian's version comparison algorithm [1], the minus
> character has a special meaning: it separates the "upstream version"
> part from the "debian revision" part.
Note that I'm not using "ls -v" to sort version numbers, just
filenames (which can contain integers in decimal notation).
--
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#35939
; Package
coreutils
.
(Wed, 26 Jun 2019 21:26:02 GMT)
Full text and
rfc822 format available.
Message #20 received at 35939 <at> debbugs.gnu.org (full text, mbox):
On 2019-06-26 12:57:14 -0700, Paul Eggert wrote:
> GNU sort uses the same algorithm as glibc strverscmp, and this algorithm has
> changed only once since strverscmp was added to glibc in 1997. The change
> was made in 2009, to fix this bug:
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=9913
Except that this bug report is wrong. But I've checked that
both "ls -v" and "sort -V" give the expected ordering on the
given example in this bug report:
zira% ls -1v
B007502280067.gbp.corp.com
B007502357019.GBP.CORP.COM
B0075022800016.gbp.corp.com
zira% printf "%s\n" * | sort -V
B007502280067.gbp.corp.com
B007502357019.GBP.CORP.COM
B0075022800016.gbp.corp.com
--
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#35939
; Package
coreutils
.
(Wed, 26 Jun 2019 23:02:01 GMT)
Full text and
rfc822 format available.
Message #23 received at 35939 <at> debbugs.gnu.org (full text, mbox):
Hello Paul,
On Wed, Jun 26, 2019 at 12:57:14PM -0700, Paul Eggert wrote:
> GNU sort uses the same algorithm as glibc strverscmp,
I think that both sort and ls use 'filevercmp' - a simplified version
that does not support locales (and doesn't fail).
The change (from 'strvercmp') was made in:
commit e505736f8211a608b00dfe75fb186a5211e1a183
Author: Kamil Dudka <kdudka <at> redhat.com>
Date: Fri Oct 3 11:03:40 2008 +0200
ls and sort: use filevercmp instead of strverscmp
https://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=e505736f8211a608b00dfe75fb186a5211e1a183
> Has the Debian version-comparison algorithm changed since 1997? If so, could
> you give details about the changes to the Debian algorithm?
I don't think the algorithm changed in Debian,
and also in gnulib there are only a handful of relevant commits, all 10
years old:
9121662f1 2008-10-03 filevercmp: new module
0443c2f39 2009-03-05 filevercmp: Move hidden files up in ordering.
1721cf06d 2009-03-24 filevercmp: handle simple~ and numbered.~3~ backup suffixes
4fd008794 2009-04-09 filevercmp: fix regression
cc96df30d 2009-04-09 filevercmp: correct today's change
I think (also based on Ian's confirmation) that this discrepancy was
from the beginning.
I now notice that there's an additional difference: coreutils/gnulib has
special handling for extension, hidden files and backup files.
As Ian wrote, a documentation improvement is probably the best fix.
I'll try to come up with a suggested change.
-assaf
P.S.
For completion, here are few other threads with details/explanations
about 'version-sort':
https://bugs.gnu.org/18168
https://bugs.gnu.org/22275
https://bugs.gnu.org/22455
https://bugs.gnu.org/33786
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#35939
; Package
coreutils
.
(Wed, 26 Jun 2019 23:50:02 GMT)
Full text and
rfc822 format available.
Message #26 received at 35939 <at> debbugs.gnu.org (full text, mbox):
Paul Eggert writes ("Re: bug#35939: version sort is incorrect with hyphen-minus"):
> GNU sort uses the same algorithm as glibc strverscmp, and this algorithm has
> changed only once since strverscmp was added to glibc in 1997. The change was
> made in 2009, to fix this bug:
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=9913
>
> Has the Debian version-comparison algorithm changed since 1997? If so, could you
> give details about the changes to the Debian algorithm? Perhaps glibc should be
> changed to stay consistent with Debian.
Debian introduced a special (and very useful) meaning for ~, many
years ago now.
I checked the Debian policy manual and according to its upgrading
checklist this change was made in 2007.
Ian.
--
Ian Jackson <ijackson <at> chiark.greenend.org.uk> These opinions are my own.
If I emailed you from an address @fyvzl.net or @evade.org.uk, that is
a private address which bypasses my fierce spamfilter.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#35939
; Package
coreutils
.
(Wed, 26 Jun 2019 23:55:02 GMT)
Full text and
rfc822 format available.
Message #29 received at 35939 <at> debbugs.gnu.org (full text, mbox):
Ian Jackson writes ("Re: bug#35939: version sort is incorrect with hyphen-minus"):
> Paul Eggert writes ("Re: bug#35939: version sort is incorrect with hyphen-minus"):
> > GNU sort uses the same algorithm as glibc strverscmp, and this algorithm has
> > changed only once since strverscmp was added to glibc in 1997. The change was
> > made in 2009, to fix this bug:
> >
> > https://sourceware.org/bugzilla/show_bug.cgi?id=9913
> >
> > Has the Debian version-comparison algorithm changed since 1997? If so, could you
> > give details about the changes to the Debian algorithm? Perhaps glibc should be
> > changed to stay consistent with Debian.
>
> Debian introduced a special (and very useful) meaning for ~, many
> years ago now.
>
> I checked the Debian policy manual and according to its upgrading
> checklist this change was made in 2007.
I have just checked the manpage I have here for strverscmp and it is
far from clear to me that the algorithm described there, and the dpkg
algorithm, produce the same answers. (Even disregarding ~, and the
fact that the specification of the dpkg algorithm is defined only over
a subset of possible strings even though the unique extension to UTF-8
strings is fairly obvious.)
--
Ian Jackson <ijackson <at> chiark.greenend.org.uk> These opinions are my own.
If I emailed you from an address @fyvzl.net or @evade.org.uk, that is
a private address which bypasses my fierce spamfilter.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#35939
; Package
coreutils
.
(Thu, 27 Jun 2019 01:41:02 GMT)
Full text and
rfc822 format available.
Message #32 received at 35939 <at> debbugs.gnu.org (full text, mbox):
Thanks for looking into this. Sorry about my confusion between
strverscmp and filevercmp. As this bug report appears to be about
filevercmp, glibc is not involved; it's only Gnulib and the utilities
using Gnulib's filevercmp module.
As I now understand it, Gnulib filevercmp is intended to be consistent
with Debian's version comparison (this is documented in filevercmp.c),
so GNU Bug#35939 is therefore based on a misunderstanding, as Gnulib
filevercmp is implementing the Debian spec correctly for this test case.
Perhaps the coreutils manual could be improved to make this all clearer,
and perhaps it should refer to the Debian manual if it doesn't already.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#35939
; Package
coreutils
.
(Thu, 27 Jun 2019 09:37:02 GMT)
Full text and
rfc822 format available.
Message #35 received at 35939 <at> debbugs.gnu.org (full text, mbox):
On 2019-06-26 18:40:50 -0700, Paul Eggert wrote:
> Perhaps the coreutils manual could be improved to make this all clearer, and
> perhaps it should refer to the Debian manual if it doesn't already.
In this case, there should be a new ordering option to provide
true numeric sort with strings mixing non-negative integers and
characters.
--
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#35939
; Package
coreutils
.
(Thu, 27 Jun 2019 10:26:01 GMT)
Full text and
rfc822 format available.
Message #38 received at 35939 <at> debbugs.gnu.org (full text, mbox):
Vincent Lefevre writes ("Re: bug#35939: version sort is incorrect with hyphen-minus"):
> On 2019-06-26 18:40:50 -0700, Paul Eggert wrote:
> > Perhaps the coreutils manual could be improved to make this all clearer, and
> > perhaps it should refer to the Debian manual if it doesn't already.
>
> In this case, there should be a new ordering option to provide
> true numeric sort with strings mixing non-negative integers and
> characters.
I think the Debian algorithm is such an algorithm, but it has a
wrinkle which you are not expecting. Here is the specification:
https://www.debian.org/doc/debian-policy/ch-controlfields.html#version
Note in particular
| The lexical comparison is a comparison of ASCII values modified so
| that all the letters sort earlier than all the non-letters and so
| that a tilde sorts before anything, even the end of a part
So in the Debian algorithm, `-' sorts after `a'. I specified this
rule. I did it mainly because of versions like `1.0beta3', which is
is probably a prerelease of `1.0' and therefore earlier than `1.0.3'.
So `b' has to sort before `.' and my rule seemed the simplest one to
achieve that. (The version comparison algorithm is a tradeoff between
complexity, and breadth of support for people's then-existing
practices.) Nowadays Debian invariably writes `1.0~beta3' but when I
invented this scheme I did not include the (invaluable) `~' feature.
When this is extended to UTF-8, presumably the ordering should be an
ordering of unicode scalar values, with the rule about letters
interpreted as referring to anything which Unicode considers a letter.
If you want to test the Debian algorithm and have access to a copy of
dpkg, you can append -1 to both strings to be the "Debian revision",
and prepend "1:" to be the "epoch", and then the middle part should be
compared the same way as sort -V etc.
Vincent, what is your use case for a comparison algorithm which is
like the Debian one but which sorts letters after punctuation ?
Ian.
--
Ian Jackson <ijackson <at> chiark.greenend.org.uk> These opinions are my own.
If I emailed you from an address @fyvzl.net or @evade.org.uk, that is
a private address which bypasses my fierce spamfilter.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#35939
; Package
coreutils
.
(Fri, 28 Jun 2019 19:19:02 GMT)
Full text and
rfc822 format available.
Message #41 received at 35939 <at> debbugs.gnu.org (full text, mbox):
* Vincent Lefevre:
> On 2019-06-26 18:40:50 -0700, Paul Eggert wrote:
>> Perhaps the coreutils manual could be improved to make this all clearer, and
>> perhaps it should refer to the Debian manual if it doesn't already.
>
> In this case, there should be a new ordering option to provide
> true numeric sort with strings mixing non-negative integers and
> characters.
There's no one true numeric sort. Some versioning schemes interpret
numbers after a dot as decimal fractions (so that 1.9 > 1.10), but it's
more common to split version strings into tuple somehow and then sort
the non-numeric parts lexicographically, and the numeric parts as
integers (so that 1.9 < 1.10).
Thanks,
Florian
This bug report was last modified 4 years and 303 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.