GNU bug report logs -
#6366
join can't join on numeric fields
Previous Next
To reply to this bug, email your comments to 6366 AT debbugs.gnu.org.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#6366
; Package
coreutils
.
(Mon, 07 Jun 2010 05:24:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Alex Shinn <alexshinn <at> gmail.com>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Mon, 07 Jun 2010 05:24:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hi,
Ideally join should be able to handle files sorted in any order
that sort provides, but as a bare minimum it should at least
be able to join files sorted on numeric fields.
The attached simple patch provides -n, --numeric-sort
options to this effect.
--
Alex
[join.c.diff (text/x-patch, attachment)]
Severity set to 'wishlist' from 'normal'
Request was from
bob <at> proulx.com (Bob Proulx)
to
control <at> debbugs.gnu.org
.
(Mon, 07 Jun 2010 22:46:02 GMT)
Full text and
rfc822 format available.
Added tag(s) patch.
Request was from
bob <at> proulx.com (Bob Proulx)
to
control <at> debbugs.gnu.org
.
(Mon, 07 Jun 2010 22:46:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#6366
; Package
coreutils
.
(Mon, 07 Jun 2010 23:00:03 GMT)
Full text and
rfc822 format available.
Message #12 received at 6366 <at> debbugs.gnu.org (full text, mbox):
On 07/06/10 06:19, Alex Shinn wrote:
> Hi,
>
> Ideally join should be able to handle files sorted in any order
> that sort provides, but as a bare minimum it should at least
> be able to join files sorted on numeric fields.
Well if there were no aliases in the numbers, you could always
sort the output numerically after the join if it was important.
However if you wanted to join "01" and "1" then your patch is required.
Are numeric aliases common enough to warrant this? I think so.
> The attached simple patch provides -n, --numeric-sort
> options to this effect.
I'd use -g, --general-numeric to correspond with `sort`.
cheers,
Pádraig.
Information forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#6366
; Package
coreutils
.
(Wed, 09 Jun 2010 01:48:02 GMT)
Full text and
rfc822 format available.
Message #15 received at 6366 <at> debbugs.gnu.org (full text, mbox):
2010/6/8 Pádraig Brady <P <at> draigbrady.com>:
> On 07/06/10 06:19, Alex Shinn wrote:
>>
>> Ideally join should be able to handle files sorted in any order
>> that sort provides, but as a bare minimum it should at least
>> be able to join files sorted on numeric fields.
>
> Well if there were no aliases in the numbers, you could always
> sort the output numerically after the join if it was important.
By first sorting lexicographically, you mean?
In the use case I had, the data was already sorted
numerically. So whenever I want to join two files,
currently I have to do:
sort file1 > file1.tmp
sort file2 > file2.tmp
join file1.tmp file2.tmp | sort -n > out
rm -f file1.tmp file2.tmp
instead of just
join -n file1 file2 > out
In the small tools philosophy you want to avoid adding
redundancy, but in this case join isn't doing the same
thing as sort, it's just working with it better. Not to mention
the fact that sort is an expensive operation to have to
perform multiple times, not just an extra O(n) filter
to throw in the middle of a pipeline.
> However if you wanted to join "01" and "1" then your patch is required.
> Are numeric aliases common enough to warrant this? I think so.
Leading zeros may not be so common, but don't forget
"1.0" and "1" or "1e2" and "100" and "100.0", etc.
> I'd use -g, --general-numeric to correspond with `sort`.
Yes, that's probably better.
--
Alex
Information forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#6366
; Package
coreutils
.
(Wed, 09 Jun 2010 06:57:02 GMT)
Full text and
rfc822 format available.
Message #18 received at 6366 <at> debbugs.gnu.org (full text, mbox):
Alex Shinn wrote:
> 2010/6/8 Pádraig Brady <P <at> draigbrady.com>:
>> On 07/06/10 06:19, Alex Shinn wrote:
>>>
>>> Ideally join should be able to handle files sorted in any order
>>> that sort provides, but as a bare minimum it should at least
>>> be able to join files sorted on numeric fields.
>>
>> Well if there were no aliases in the numbers, you could always
>> sort the output numerically after the join if it was important.
>
> By first sorting lexicographically, you mean?
> In the use case I had, the data was already sorted
> numerically. So whenever I want to join two files,
> currently I have to do:
>
> sort file1 > file1.tmp
> sort file2 > file2.tmp
> join file1.tmp file2.tmp | sort -n > out
> rm -f file1.tmp file2.tmp
>
> instead of just
>
> join -n file1 file2 > out
>
> In the small tools philosophy you want to avoid adding
> redundancy, but in this case join isn't doing the same
> thing as sort, it's just working with it better. Not to mention
> the fact that sort is an expensive operation to have to
> perform multiple times, not just an extra O(n) filter
> to throw in the middle of a pipeline.
>
>> However if you wanted to join "01" and "1" then your patch is required.
>> Are numeric aliases common enough to warrant this? I think so.
>
> Leading zeros may not be so common, but don't forget
> "1.0" and "1" or "1e2" and "100" and "100.0", etc.
>
>> I'd use -g, --general-numeric to correspond with `sort`.
>
> Yes, that's probably better.
There may be a fly in the ointment.
When comparing floating point numbers how would join measure equality?
Should it consider 1.000000000000001e2 to be equal to 100.0 ?
What if the maximum precision available does not
allow us to distinguish those two values?
What about -0 and 0? (with IEEE 754, they'll compare equal)
Information forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#6366
; Package
coreutils
.
(Wed, 09 Jun 2010 07:34:02 GMT)
Full text and
rfc822 format available.
Message #21 received at 6366 <at> debbugs.gnu.org (full text, mbox):
On Wed, Jun 9, 2010 at 3:56 PM, Jim Meyering <jim <at> meyering.net> wrote:
>
> When comparing floating point numbers how would join measure equality?
> Should it consider 1.000000000000001e2 to be equal to 100.0 ?
> What if the maximum precision available does not
> allow us to distinguish those two values?
Indeed, that's why you generally don't use floating
point numbers as database keys.
We could either restrict the numbers to integers,
or continue to allow floats with a note to the effect
that float precision is machine-specific. Personally
I have no need for floats so am happy with the former.
--
Alex
Information forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#6366
; Package
coreutils
.
(Wed, 09 Jun 2010 17:08:02 GMT)
Full text and
rfc822 format available.
Message #24 received at submit <at> debbugs.gnu.org (full text, mbox):
On 06/08/2010 11:56 PM, Jim Meyering wrote:
> There may be a fly in the ointment.
>
> When comparing floating point numbers how would join measure equality?
The point is that "join" should be compatible with "sort".
Any option that "sort" has to compare fields,
is an option that "join" should also have.
The same code should be used for both "join" and "sort",
to do comparison. So, if "sort" has an option to do
IEEE-754 comparison in a certain way, "join" should
have the same option.
Arguably "uniq" should have the same set of options,
when checking whether two lines are equal, but I'd
say that's lower priority.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#6366
; Package
coreutils
.
(Wed, 21 Mar 2012 16:41:03 GMT)
Full text and
rfc822 format available.
Message #27 received at 6366 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hi,
I've attached a patch that implements both "numeric sort" (-n) and "general
numeric sort" (-g).. I copied the relevant comparison code from sort.c to
ensure consistent behavior. Should these functions be extracted and put in
a shared header file, or is it okay to duplicate the code?
In terms of testing, I confirmed that the code correctly deals with
different single-byte thousand separators on my local machine, but I wasn't
sure how to exercise that in the test suite since I can't rely on everyone
having a particular locale available. How should that be handled?
I also updated the relevant docs.
Thanks,
Drew
P.S. I sent a previous version of the patch to the wrong address and
accidentally initiated a new bug thread (#10924). This patch supersedes the
previous.
[Message part 2 (text/html, inline)]
[join.numeric.sort.diff (text/x-patch, attachment)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#6366
; Package
coreutils
.
(Wed, 22 Aug 2012 22:02:01 GMT)
Full text and
rfc822 format available.
Message #30 received at 6366 <at> debbugs.gnu.org (full text, mbox):
unarchive 6366
stop
On 08/22/2012 09:17 PM, Sam Steingold wrote:
> Hi,
> I have a file of numbers (integers IDs) which are sorted numerically,
> but comm complains that they are not.
> I suggest that comm accept "-n" option (like sort) to use numeric sort.
> Thanks.
Yes.
comm, join, uniq really should support the same field selection
and comparision flags as sort.
There are already bugs for that:
http://bugs.gnu.org/5832
http://bugs.gnu.org/6366
For reference, the following examples show
expected an unexpected behavior respectively:
$ comm --nocheck <(printf "%s\n" a b c j k) <(printf "%s\n" a b d e j)
a
b
c
d
e
j
k
$ comm --nocheck <(printf "%s\n" 1 2 3 10 11) <(printf "%s\n" 1 2 5 6 10)
1
2
3
10
11
5
6
10
cheers,
Pádraig.
Forcibly Merged 6366 12264.
Request was from
Pádraig Brady <P <at> draigBrady.com>
to
control <at> debbugs.gnu.org
.
(Thu, 23 Aug 2012 08:54:01 GMT)
Full text and
rfc822 format available.
Forcibly Merged 6366 10924 12264.
Request was from
era eriksson <era <at> iki.fi>
to
control <at> debbugs.gnu.org
.
(Thu, 30 Aug 2012 08:08:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#6366
; Package
coreutils
.
(Tue, 09 Oct 2018 20:20:01 GMT)
Full text and
rfc822 format available.
Message #37 received at 6366 <at> debbugs.gnu.org (full text, mbox):
severity 6366 wishlist
stop
(triaging old bugs)
Hello,
On 22/08/12 04:00 PM, Pádraig Brady wrote:
>
> On 08/22/2012 09:17 PM, Sam Steingold wrote:
>> I have a file of numbers (integers IDs) which are sorted numerically,
>> but comm complains that they are not.
>> I suggest that comm accept "-n" option (like sort) to use numeric sort.
>> Thanks.
>
> Yes.
> comm, join, uniq really should support the same field selection
> and comparision flags as sort.
>
> There are already bugs for that:
> http://bugs.gnu.org/5832
> http://bugs.gnu.org/6366
There is an on-going (though somewhat dormant) effort to
consolidate the key comparison of uniq/join/sort into one common
module:
https://lists.gnu.org/r/coreutils/2013-02/msg00087.html
https://lists.gnu.org/r/coreutils/2016-04/msg00063.html
As such, I'm marking this as a wishlist item, and hopefully we'll get
to it sooner or later...
regards,
- assaf
Severity set to 'wishlist' from 'normal'
Request was from
Assaf Gordon <assafgordon <at> gmail.com>
to
control <at> debbugs.gnu.org
.
(Tue, 09 Oct 2018 20:20:02 GMT)
Full text and
rfc822 format available.
This bug report was last modified 5 years and 204 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.