GNU bug report logs - #18540
Sorting bug?

Previous Next

Package: coreutils;

Reported by: Göran Uddeborg <goeran <at> uddeborg.se>

Date: Tue, 23 Sep 2014 20:32:02 UTC

Severity: normal

Tags: notabug

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 18540 in the body.
You can then email your comments to 18540 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#18540; Package coreutils. (Tue, 23 Sep 2014 20:32:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Göran Uddeborg <goeran <at> uddeborg.se>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Tue, 23 Sep 2014 20:32:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Göran Uddeborg <goeran <at> uddeborg.se>
To: bug-coreutils <at> gnu.org
Subject: Sorting bug?
Date: Tue, 23 Sep 2014 22:24:45 +0200
I discovered a behaviour of "sort" that looks like a bug to me.  When
one key in the input is an initial part of another key, the shorter
key is sorted first if the key is all there is on the line.  But if
there are other fields too, not included in the key, the order
changes.  That is true even with the --stable flag, so "sort" seems to
consider the order of the keys different in the two cases.

I sort in a non-C locale.  sv_SE.utf8 actually, but en_US.utf8 behaves
the same so I illustrate using that.

First case, the key is all there is on the line.  The shorter line
gets sorted earlier, regardless of input order:

    [göran <at> mimmi Hämtat]$ { echo 'binutils x86_64'; echo 'binutils-x86_64-linux-gnu x86_64'; } | LANG=en_US.utf8 sort --stable --debug --key=1,1 --field-separator=!
    sort: using ‘en_US.utf8’ sorting rules
    binutils x86_64
    _______________
    binutils-x86_64-linux-gnu x86_64
    ________________________________
    [göran <at> mimmi Hämtat]$ { echo 'binutils-x86_64-linux-gnu x86_64'; echo 'binutils x86_64'; } | LANG=en_US.utf8 sort --stable --debug --key=1,1 --field-separator=!
    sort: using ‘en_US.utf8’ sorting rules
    binutils x86_64
    _______________
    binutils-x86_64-linux-gnu x86_64
    ________________________________



Second case, the input lines contains a second field.  Now the longer
field gets sorted earlier, regardless of input order:

    [göran <at> mimmi Hämtat]$ { echo 'binutils x86_64!new'; echo 'binutils-x86_64-linux-gnu x86_64!new'; } | LANG=en_US.utf8 sort --stable --debug --key=1,1 --field-separator=!
    sort: using ‘en_US.utf8’ sorting rules
    binutils-x86_64-linux-gnu x86_64!new
    ________________________________
    binutils x86_64!new
    _______________
    [göran <at> mimmi Hämtat]$ { echo 'binutils-x86_64-linux-gnu x86_64!new'; echo 'binutils x86_64!new'; } | LANG=en_US.utf8 sort --stable --debug --key=1,1 --field-separator=!
    sort: using ‘en_US.utf8’ sorting rules
    binutils-x86_64-linux-gnu x86_64!new
    ________________________________
    binutils x86_64!new
    _______________


I can't see any reason for this.  Is it me not understanding sorting,
or is it actually a bug?




Information forwarded to bug-coreutils <at> gnu.org:
bug#18540; Package coreutils. (Tue, 23 Sep 2014 20:59:02 GMT) Full text and rfc822 format available.

Message #8 received at 18540 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Göran Uddeborg <goeran <at> uddeborg.se>,
 18540 <at> debbugs.gnu.org
Subject: Re: bug#18540: Sorting bug?
Date: Tue, 23 Sep 2014 14:58:01 -0600
[Message part 1 (text/plain, inline)]
On 09/23/2014 02:24 PM, Göran Uddeborg wrote:
> I discovered a behaviour of "sort" that looks like a bug to me.  When

Thanks for the report.  Most likely, it's not a bug in sort but in your
expectations, but let's analyze it as we go...

> one key in the input is an initial part of another key, the shorter
> key is sorted first if the key is all there is on the line.  But if
> there are other fields too, not included in the key, the order
> changes.  That is true even with the --stable flag, so "sort" seems to
> consider the order of the keys different in the two cases.
> 
> I sort in a non-C locale.  sv_SE.utf8 actually, but en_US.utf8 behaves
> the same so I illustrate using that.

Thanks for realizing that locale matters.  Not everyone does!

> 
> First case, the key is all there is on the line.  The shorter line
> gets sorted earlier, regardless of input order:
> 
>     [göran <at> mimmi Hämtat]$ { echo 'binutils x86_64'; echo 'binutils-x86_64-linux-gnu x86_64'; } | LANG=en_US.utf8 sort --stable --debug --key=1,1 --field-separator=!

Wow - someone actually using the --debug option!  Makes diagnosis a LOT
easier, doesn't it!

Makes sense so far; in your locale, space and - collate identically, and
the rest of the line is a common prefix up to the shorter length.  Then
the NUL byte that ends the shorter line sorts before the suffix of the
second line.

> 
> Second case, the input lines contains a second field.  Now the longer
> field gets sorted earlier, regardless of input order:
> 

Oh my, it looks like you have indeed found an issue.  First, I'm going
to try and whittle it down to something that fits in a narrower window:

$ printf 'a b\na-b-c\n' | LANG=en_US.utf8 sort -s --debug -k1,1 -t!sort:
using ‘en_US.utf8’ sorting rules
a b
___
a-b-c
_____
$ printf 'a b!x\na-b-c!x\n' | LANG=en_US.utf8 sort -s --debug -k1,1 -t!
sort: using ‘en_US.utf8’ sorting rules
a-b-c!x
_____
a b!x
___


And of course, switching to the C locale makes the problem disappear,
even when I munge the prefix to be identical (rather than merely
collating identically).

> I can't see any reason for this.  Is it me not understanding sorting,
> or is it actually a bug?

Let's look further:

$ printf 'a b\na-b-c\n' | LANG=en_US.utf8 ltrace -e strcoll sort -s
--debug -k1,1 -t!
sort: using ‘en_US.utf8’ sorting rules
sort->strcoll("a b", "a-b-c")                    = -1
a b
___
a-b-c
_____
+++ exited (status 0) +++

ltrace says that we are indeed using strcoll(), and on the short form,
we are comparing the entire line.

Then on the longer form,

$ printf 'a b!x\na-b-c!x\n' | LANG=en_US.utf8 ltrace -e strcoll sort -s
--debug -k1,1 -t!
sort: using ‘en_US.utf8’ sorting rules
sort->strcoll("a b!x", "a-b-c!x")                = 21
a-b-c!x
_____
a b!x
___
+++ exited (status 0) +++


Huh? Why are we passing the ENTIRE line to strcoll?  Shouldn't we only
be passing the key?

Count yourself lucky - you may have actually found a bug! Very few
people can claim to find sort bugs (most reports are due to faulty user
expectations).  I'm still not sure where the code is going wrong, but it
indeed looks like something we need to fix.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Added tag(s) notabug. Request was from Eric Blake <eblake <at> redhat.com> to control <at> debbugs.gnu.org. (Tue, 23 Sep 2014 21:37:01 GMT) Full text and rfc822 format available.

Reply sent to Eric Blake <eblake <at> redhat.com>:
You have taken responsibility. (Tue, 23 Sep 2014 21:37:02 GMT) Full text and rfc822 format available.

Notification sent to Göran Uddeborg <goeran <at> uddeborg.se>:
bug acknowledged by developer. (Tue, 23 Sep 2014 21:37:03 GMT) Full text and rfc822 format available.

Message #15 received at 18540-done <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Göran Uddeborg <goeran <at> uddeborg.se>,
 18540-done <at> debbugs.gnu.org
Subject: Re: bug#18540: Sorting bug?
Date: Tue, 23 Sep 2014 15:36:53 -0600
[Message part 1 (text/plain, inline)]
tag 18540 notabug
thanks

On 09/23/2014 02:58 PM, Eric Blake wrote:

> Let's look further:
> 

> $ printf 'a b!x\na-b-c!x\n' | LANG=en_US.utf8 ltrace -e strcoll sort -s
> --debug -k1,1 -t!
> sort: using ‘en_US.utf8’ sorting rules
> sort->strcoll("a b!x", "a-b-c!x")                = 21
> a-b-c!x
> _____
> a b!x
> ___
> +++ exited (status 0) +++

Hmm, I just noticed something.

> 
> 
> Huh? Why are we passing the ENTIRE line to strcoll?  Shouldn't we only
> be passing the key?

That was my distro's build of sort (in my case, Fedora 20, with sort
from GNU coreutils 8.21).  But looking at coreutils.git (v8.23-39-g1ff4d08),

$ printf 'a b!x\na-b-c!x\n' | LANG=en_US.utf8 ltrace -e strcoll
./src/sort -s --debug -k1,1 -t!
./src/sort: using ‘en_US.utf8’ sorting rules
sort->strcoll("a b", "a-b-c")                    = -1
a b!x
___
a-b-c!x
_____
+++ exited (status 0) +++

Yay - strcoll now uses the correct bounds.  Next step - determining if
this is an upstream problem that was fixed in the interim, or if this is
a bug in the downstream additions on top of stock upstream.  None of the
9 commits in 'git shortlog v8.21.. src/sort.c' seem to describe the
situation.

And looking at my distro's patches, there is definitely some gorp added
to sort.c in coreutils-i18n.patch, which I highly suspect to be the root
cause.

So please re-raise this as a downstream bug in your distro's i18n patch,
as upstream coreutils is immune.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#18540; Package coreutils. (Wed, 24 Sep 2014 17:42:01 GMT) Full text and rfc822 format available.

Message #18 received at 18540 <at> debbugs.gnu.org (full text, mbox):

From: Göran Uddeborg <goeran <at> uddeborg.se>
To: 18540 <at> debbugs.gnu.org
Subject: bug#18540: closed (Re: bug#18540: Sorting bug?)
Date: Wed, 24 Sep 2014 19:40:55 +0200
Drats!  I felt so proud after the last paragraph of your first
comment! :-)

But I've filed https://bugzilla.redhat.com/show_bug.cgi?id=1146185
(I'm also using Fedora).




Information forwarded to bug-coreutils <at> gnu.org:
bug#18540; Package coreutils. (Wed, 24 Sep 2014 19:00:05 GMT) Full text and rfc822 format available.

Message #21 received at 18540 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Göran Uddeborg <goeran <at> uddeborg.se>, 
 18540 <at> debbugs.gnu.org
Subject: Re: bug#18540: closed (Re: bug#18540: Sorting bug?)
Date: Wed, 24 Sep 2014 11:59:10 -0700
[Message part 1 (text/plain, inline)]
On 09/24/2014 10:40 AM, Göran Uddeborg wrote:
> Drats!  I felt so proud after the last paragraph of your first
> comment! :-)

I would like to chime in with thanks for reporting this bug, which is a 
real howler.  To help the Fedora folks squash it I installed the 
attached test case upstream.  This test succeeds with upstream 'sort' 
but fails with the Fedora 20 version.
[0001-test-check-for-Fedora-20-sort-key-bug.patch (text/x-patch, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#18540; Package coreutils. (Tue, 30 Sep 2014 00:36:02 GMT) Full text and rfc822 format available.

Message #24 received at 18540 <at> debbugs.gnu.org (full text, mbox):

From: Bernhard Voelker <mail <at> bernhard-voelker.de>
To: Paul Eggert <eggert <at> cs.ucla.edu>, Göran Uddeborg
 <goeran <at> uddeborg.se>, 18540 <at> debbugs.gnu.org, Ondřej Vašík <ovasik <at> redhat.com>, 
 Andreas Schwab <schwab <at> linux-m68k.org>
Subject: Re: bug#18540: closed (Re: bug#18540: Sorting bug?)
Date: Tue, 30 Sep 2014 02:31:27 +0200
On 09/24/2014 08:59 PM, Paul Eggert wrote:
> To help the Fedora folks squash it I installed the attached test case upstream.  This test succeeds with upstream
> 'sort' but fails with the Fedora 20 version.

FWIW there's now a downstream patch on top of the coreutils-i18n.patch
available on the openSUSE build server:

https://build.opensuse.org/package/view_file/Base:System/coreutils/sort-keycompare-mb.patch?expand=1

Thanks to Andreas Schwab.

Have a nice day,
Berny




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 28 Oct 2014 11:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 9 years and 181 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.