GNU bug report logs - #6366
join can't join on numeric fields

Previous Next

Package: coreutils;

Reported by: Alex Shinn <alexshinn <at> gmail.com>

Date: Mon, 7 Jun 2010 05:24:02 UTC

Severity: wishlist

Tags: patch

Merged with 10924, 12264

To reply to this bug, email your comments to 6366 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6366; Package coreutils. (Mon, 07 Jun 2010 05:24:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Alex Shinn <alexshinn <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Mon, 07 Jun 2010 05:24:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Alex Shinn <alexshinn <at> gmail.com>
To: bug-coreutils <at> gnu.org
Subject: join can't join on numeric fields
Date: Mon, 7 Jun 2010 14:19:24 +0900
[Message part 1 (text/plain, inline)]
Hi,

Ideally join should be able to handle files sorted in any order
that sort provides, but as a bare minimum it should at least
be able to join files sorted on numeric fields.

The attached simple patch provides -n, --numeric-sort
options to this effect.

-- 
Alex
[join.c.diff (text/x-patch, attachment)]

Severity set to 'wishlist' from 'normal' Request was from bob <at> proulx.com (Bob Proulx) to control <at> debbugs.gnu.org. (Mon, 07 Jun 2010 22:46:02 GMT) Full text and rfc822 format available.

Added tag(s) patch. Request was from bob <at> proulx.com (Bob Proulx) to control <at> debbugs.gnu.org. (Mon, 07 Jun 2010 22:46:02 GMT) Full text and rfc822 format available.

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6366; Package coreutils. (Mon, 07 Jun 2010 23:00:03 GMT) Full text and rfc822 format available.

Message #12 received at 6366 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Alex Shinn <alexshinn <at> gmail.com>
Cc: 6366 <at> debbugs.gnu.org
Subject: Re: bug#6366: join can't join on numeric fields
Date: Mon, 07 Jun 2010 23:59:13 +0100
On 07/06/10 06:19, Alex Shinn wrote:
> Hi,
> 
> Ideally join should be able to handle files sorted in any order
> that sort provides, but as a bare minimum it should at least
> be able to join files sorted on numeric fields.

Well if there were no aliases in the numbers, you could always
sort the output numerically after the join if it was important.
However if you wanted to join "01" and "1" then your patch is required.
Are numeric aliases common enough to warrant this? I think so.

> The attached simple patch provides -n, --numeric-sort
> options to this effect.

I'd use -g, --general-numeric to correspond with `sort`.

cheers,
Pádraig.




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6366; Package coreutils. (Wed, 09 Jun 2010 01:48:02 GMT) Full text and rfc822 format available.

Message #15 received at 6366 <at> debbugs.gnu.org (full text, mbox):

From: Alex Shinn <alexshinn <at> gmail.com>
To: Pádraig Brady <P <at> draigbrady.com>
Cc: 6366 <at> debbugs.gnu.org
Subject: Re: bug#6366: join can't join on numeric fields
Date: Wed, 9 Jun 2010 10:47:05 +0900
2010/6/8 Pádraig Brady <P <at> draigbrady.com>:
> On 07/06/10 06:19, Alex Shinn wrote:
>>
>> Ideally join should be able to handle files sorted in any order
>> that sort provides, but as a bare minimum it should at least
>> be able to join files sorted on numeric fields.
>
> Well if there were no aliases in the numbers, you could always
> sort the output numerically after the join if it was important.

By first sorting lexicographically, you mean?
In the use case I had, the data was already sorted
numerically.  So whenever I want to join two files,
currently I have to do:

  sort file1 > file1.tmp
  sort file2 > file2.tmp
  join file1.tmp file2.tmp | sort -n > out
  rm -f file1.tmp file2.tmp

instead of just

  join -n file1 file2 > out

In the small tools philosophy you want to avoid adding
redundancy, but in this case join isn't doing the same
thing as sort, it's just working with it better.  Not to mention
the fact that sort is an expensive operation to have to
perform multiple times, not just an extra O(n) filter
to throw in the middle of a pipeline.

> However if you wanted to join "01" and "1" then your patch is required.
> Are numeric aliases common enough to warrant this? I think so.

Leading zeros may not be so common, but don't forget
"1.0" and "1" or "1e2" and "100" and "100.0", etc.

> I'd use -g, --general-numeric to correspond with `sort`.

Yes, that's probably better.

-- 
Alex




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6366; Package coreutils. (Wed, 09 Jun 2010 06:57:02 GMT) Full text and rfc822 format available.

Message #18 received at 6366 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Alex Shinn <alexshinn <at> gmail.com>
Cc: Pádraig Brady <P <at> draigbrady.com>, 6366 <at> debbugs.gnu.org
Subject: Re: bug#6366: join can't join on numeric fields
Date: Wed, 09 Jun 2010 08:56:07 +0200
Alex Shinn wrote:

> 2010/6/8 Pádraig Brady <P <at> draigbrady.com>:
>> On 07/06/10 06:19, Alex Shinn wrote:
>>>
>>> Ideally join should be able to handle files sorted in any order
>>> that sort provides, but as a bare minimum it should at least
>>> be able to join files sorted on numeric fields.
>>
>> Well if there were no aliases in the numbers, you could always
>> sort the output numerically after the join if it was important.
>
> By first sorting lexicographically, you mean?
> In the use case I had, the data was already sorted
> numerically.  So whenever I want to join two files,
> currently I have to do:
>
>   sort file1 > file1.tmp
>   sort file2 > file2.tmp
>   join file1.tmp file2.tmp | sort -n > out
>   rm -f file1.tmp file2.tmp
>
> instead of just
>
>   join -n file1 file2 > out
>
> In the small tools philosophy you want to avoid adding
> redundancy, but in this case join isn't doing the same
> thing as sort, it's just working with it better.  Not to mention
> the fact that sort is an expensive operation to have to
> perform multiple times, not just an extra O(n) filter
> to throw in the middle of a pipeline.
>
>> However if you wanted to join "01" and "1" then your patch is required.
>> Are numeric aliases common enough to warrant this? I think so.
>
> Leading zeros may not be so common, but don't forget
> "1.0" and "1" or "1e2" and "100" and "100.0", etc.
>
>> I'd use -g, --general-numeric to correspond with `sort`.
>
> Yes, that's probably better.

There may be a fly in the ointment.

When comparing floating point numbers how would join measure equality?
Should it consider 1.000000000000001e2 to be equal to 100.0 ?
What if the maximum precision available does not
allow us to distinguish those two values?

What about -0 and 0? (with IEEE 754, they'll compare equal)




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6366; Package coreutils. (Wed, 09 Jun 2010 07:34:02 GMT) Full text and rfc822 format available.

Message #21 received at 6366 <at> debbugs.gnu.org (full text, mbox):

From: Alex Shinn <alexshinn <at> gmail.com>
To: Jim Meyering <jim <at> meyering.net>
Cc: Pádraig Brady <P <at> draigbrady.com>, 6366 <at> debbugs.gnu.org
Subject: Re: bug#6366: join can't join on numeric fields
Date: Wed, 9 Jun 2010 16:33:06 +0900
On Wed, Jun 9, 2010 at 3:56 PM, Jim Meyering <jim <at> meyering.net> wrote:
>
> When comparing floating point numbers how would join measure equality?
> Should it consider 1.000000000000001e2 to be equal to 100.0 ?
> What if the maximum precision available does not
> allow us to distinguish those two values?

Indeed, that's why you generally don't use floating
point numbers as database keys.

We could either restrict the numbers to integers,
or continue to allow floats with a note to the effect
that float precision is machine-specific.  Personally
I have no need for floats so am happy with the former.

-- 
Alex




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6366; Package coreutils. (Wed, 09 Jun 2010 17:08:02 GMT) Full text and rfc822 format available.

Message #24 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> CS.UCLA.EDU>
To: bug-coreutils <at> gnu.org
Subject: Re: bug#6366: join can't join on numeric fields
Date: Wed, 09 Jun 2010 10:06:56 -0700
On 06/08/2010 11:56 PM, Jim Meyering wrote:

> There may be a fly in the ointment.
> 
> When comparing floating point numbers how would join measure equality?

The point is that "join" should be compatible with "sort".
Any option that "sort" has to compare fields,
is an option that "join" should also have.
The same code should be used for both "join" and "sort",
to do comparison.  So, if "sort" has an option to do
IEEE-754 comparison in a certain way, "join" should
have the same option.

Arguably "uniq" should have the same set of options,
when checking whether two lines are equal, but I'd
say that's lower priority.




Information forwarded to bug-coreutils <at> gnu.org:
bug#6366; Package coreutils. (Wed, 21 Mar 2012 16:41:03 GMT) Full text and rfc822 format available.

Message #27 received at 6366 <at> debbugs.gnu.org (full text, mbox):

From: Drew Frank <ajfrank <at> ics.uci.edu>
To: 6366 <at> debbugs.gnu.org
Subject: join can't join on numeric fields
Date: Tue, 20 Mar 2012 22:34:38 -0700
[Message part 1 (text/plain, inline)]
Hi,

I've attached a patch that implements both "numeric sort" (-n) and "general
numeric sort" (-g).. I copied the relevant comparison code from sort.c to
ensure consistent behavior. Should these functions be extracted and put in
a shared header file, or is it okay to duplicate the code?

In terms of testing, I confirmed that the code correctly deals with
different single-byte thousand separators on my local machine, but I wasn't
sure how to exercise that in the test suite since I can't rely on everyone
having a particular locale available. How should that be handled?

I also updated the relevant docs.

Thanks,
Drew

P.S. I sent a previous version of the patch to the wrong address and
accidentally initiated a new bug thread (#10924). This patch supersedes the
previous.
[Message part 2 (text/html, inline)]
[join.numeric.sort.diff (text/x-patch, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#6366; Package coreutils. (Wed, 22 Aug 2012 22:02:01 GMT) Full text and rfc822 format available.

Message #30 received at 6366 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: sds <at> gnu.org
Cc: coreutils <at> gnu.org, 6366 <at> debbugs.gnu.org
Subject: Re: comm: use numeric sort (optionally)
Date: Wed, 22 Aug 2012 23:00:53 +0100
unarchive 6366
stop

On 08/22/2012 09:17 PM, Sam Steingold wrote:
> Hi,
> I have a file of numbers (integers IDs) which are sorted numerically,
> but comm complains that they are not.
> I suggest that comm accept "-n" option (like sort) to use numeric sort.
> Thanks.

Yes.
comm, join, uniq really should support the same field selection
and comparision flags as sort.

There are already bugs for that:
http://bugs.gnu.org/5832
http://bugs.gnu.org/6366

For reference, the following examples show
expected an unexpected behavior respectively:

$ comm --nocheck <(printf "%s\n" a b c j k) <(printf "%s\n" a b d e j)
		a
		b
c
	d
	e
		j
k

$ comm --nocheck <(printf "%s\n" 1 2 3 10 11) <(printf "%s\n" 1 2 5 6 10)
		1
		2
3
10
11
	5
	6
	10

cheers,
Pádraig.




Forcibly Merged 6366 12264. Request was from Pádraig Brady <P <at> draigBrady.com> to control <at> debbugs.gnu.org. (Thu, 23 Aug 2012 08:54:01 GMT) Full text and rfc822 format available.

Forcibly Merged 6366 10924 12264. Request was from era eriksson <era <at> iki.fi> to control <at> debbugs.gnu.org. (Thu, 30 Aug 2012 08:08:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-coreutils <at> gnu.org:
bug#6366; Package coreutils. (Tue, 09 Oct 2018 20:20:01 GMT) Full text and rfc822 format available.

Message #37 received at 6366 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Pádraig Brady <P <at> draigBrady.com>
Cc: 6366 <at> debbugs.gnu.org
Subject: Re: bug#6366: comm: use numeric sort (optionally)
Date: Tue, 9 Oct 2018 14:19:04 -0600
severity 6366 wishlist
stop

(triaging old bugs)

Hello,

On 22/08/12 04:00 PM, Pádraig Brady wrote:
> 
> On 08/22/2012 09:17 PM, Sam Steingold wrote:
>> I have a file of numbers (integers IDs) which are sorted numerically,
>> but comm complains that they are not.
>> I suggest that comm accept "-n" option (like sort) to use numeric sort.
>> Thanks.
> 
> Yes.
> comm, join, uniq really should support the same field selection
> and comparision flags as sort.
> 
> There are already bugs for that:
> http://bugs.gnu.org/5832
> http://bugs.gnu.org/6366

There is an on-going (though somewhat dormant) effort to
consolidate the key comparison of uniq/join/sort into one common
module:

  https://lists.gnu.org/r/coreutils/2013-02/msg00087.html
  https://lists.gnu.org/r/coreutils/2016-04/msg00063.html


As such, I'm marking this as a wishlist item, and hopefully we'll get
to it sooner or later...

regards,
 - assaf






Severity set to 'wishlist' from 'normal' Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Tue, 09 Oct 2018 20:20:02 GMT) Full text and rfc822 format available.

This bug report was last modified 5 years and 204 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.