GNU bug report logs - #16468
join

Previous Next

Package: coreutils;

Reported by: barry kesner <modockesner <at> gmail.com>

Date: Thu, 16 Jan 2014 17:07:01 UTC

Severity: normal

Tags: notabug

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 16468 in the body.
You can then email your comments to 16468 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#16468; Package coreutils. (Thu, 16 Jan 2014 17:07:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to barry kesner <modockesner <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Thu, 16 Jan 2014 17:07:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: barry kesner <modockesner <at> gmail.com>
To: bug-coreutils <at> gnu.org
Subject: join
Date: Thu, 16 Jan 2014 10:29:27 -0500
[Message part 1 (text/plain, inline)]
join is failing on large numbers somehow

I have 2 files to join
file 1
99910287    1
99978720    1
99980081    1
99980180    2
99980281    1
99980406    1
99980932    1
99982402    1
100002132   1
100002162   2
100002166   3
file 2 contains
99980081    1
100002129   1
100002136   2
100002162   3

Join fails to join properly only giving 99980081
if I prefix the 9's with a 0 join does not fail

Barry
[Message part 2 (text/html, inline)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#16468; Package coreutils. (Thu, 16 Jan 2014 17:16:02 GMT) Full text and rfc822 format available.

Message #8 received at 16468 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: barry kesner <modockesner <at> gmail.com>, 16468 <at> debbugs.gnu.org
Subject: Re: bug#16468: join
Date: Thu, 16 Jan 2014 10:15:04 -0700
[Message part 1 (text/plain, inline)]
On 01/16/2014 08:29 AM, barry kesner wrote:
> join is failing on large numbers somehow
> 

> 
> Join fails to join properly only giving 99980081
> if I prefix the 9's with a 0 join does not fail

Sounds to me like you didn't heed this advice in the --help text:

Important: FILE1 and FILE2 must be sorted on the join fields.
E.g., use "sort -k 1b,1" if 'join' has no options,
or use "join -t ''" if 'sort' has no options.
Note, comparisons honor the rules specified by 'LC_COLLATE'.
If the input is not sorted and some lines cannot be joined, a
warning message will be given.

Does running 'LC_ALL=C join' change the behavior for you, in which case
it was an issue of your choice of LC_COLLATE?

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#16468; Package coreutils. (Thu, 16 Jan 2014 18:11:02 GMT) Full text and rfc822 format available.

Message #11 received at 16468 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: barry kesner <modockesner <at> gmail.com>, 16468 <at> debbugs.gnu.org
Subject: Re: bug#16468: join
Date: Thu, 16 Jan 2014 11:10:11 -0700
[Message part 1 (text/plain, inline)]
[re-adding the list, with permission]

On 01/16/2014 10:46 AM, barry kesner wrote:
> Eric,
>   Thanks for response.
>  I now realize it wants sorted alpha input not numerical.  999 1000 1001 is
> how it is sorted.

I think there have been requests in the past to enhance 'join' so that
it can have more fine-tuned control over how its fields are selected.
Maybe something like sharing code so that 'join -1 k1,1n' would behave
like it were using 'sort -k1,1n' sorting on file 1.  But right now, that
functionality doesn't exist.

> 
>   How do you tell join this without resorting.  The files are huge!

Unfortunately, there isn't any really good way, short of re-processing
the files to make the data appear sorted in the order join expects.
That said, it certainly appears that for your given data, you can write
a sed filter that can reprocess on a line-by-line basis, and feed that
into join, without the penalty of having to re-sort the entire file and
without having to have the processed file stored in your file system all
at once.  It also seems possible to write a post filter to get back to
the style of the line in the original file.  Here, extensions such as bash's
  join <(infilter file1) <(infilter file2) | outfilter
make it easier to type (where the trick is to now write the correct sed
scripts to serve as infilter and outfilter) than the alternative of
having to use named fifos for limiting yourself to just POSIX semantics.

> 
> I can't find LC_COLLATE?

It's an environment variable, like LC_ALL, that affects your locale.
Running 'locale' will show you your current locale settings, including
LC_COLLATE.  Setting LC_ALL in the environment is shorthand that forces
all other categories to behave the same, so it's easier to test whether
'LC_ALL=C command' has an effect than it is to figure out which locale
category(ies) matter.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#16468; Package coreutils. (Thu, 16 Jan 2014 19:33:02 GMT) Full text and rfc822 format available.

Message #14 received at 16468 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Eric Blake <eblake <at> redhat.com>
Cc: barry kesner <modockesner <at> gmail.com>, 16468 <at> debbugs.gnu.org
Subject: Re: bug#16468: join
Date: Thu, 16 Jan 2014 19:31:59 +0000
On 01/16/2014 06:10 PM, Eric Blake wrote:
> [re-adding the list, with permission]
> 
> On 01/16/2014 10:46 AM, barry kesner wrote:
>>   How do you tell join this without resorting.  The files are huge!
> 
> Unfortunately, there isn't any really good way, short of re-processing
> the files to make the data appear sorted in the order join expects.

Note we are working on merging sort, uniq, and join key selection
and comparison code, to support this directly.

http://lists.gnu.org/archive/html/coreutils/2013-09/msg00047.html

thanks,
Pádraig.




Information forwarded to bug-coreutils <at> gnu.org:
bug#16468; Package coreutils. (Fri, 17 Jan 2014 00:01:01 GMT) Full text and rfc822 format available.

Message #17 received at 16468 <at> debbugs.gnu.org (full text, mbox):

From: Bernhard Voelker <mail <at> bernhard-voelker.de>
To: Eric Blake <eblake <at> redhat.com>, barry kesner <modockesner <at> gmail.com>, 
 16468 <at> debbugs.gnu.org
Subject: Re: bug#16468: join
Date: Fri, 17 Jan 2014 01:00:10 +0100
On 01/16/2014 07:10 PM, Eric Blake wrote:
> On 01/16/2014 10:46 AM, barry kesner wrote:
>>   How do you tell join this without resorting.  The files are huge!
> 
> Unfortunately, there isn't any really good way, short of re-processing
> the files to make the data appear sorted in the order join expects.
> That said, it certainly appears that for your given data, you can write
> a sed filter that can reprocess on a line-by-line basis, and feed that
> into join, without the penalty of having to re-sort the entire file and
> without having to have the processed file stored in your file system all
> at once.  It also seems possible to write a post filter to get back to
> the style of the line in the original file.  Here, extensions such as bash's
>   join <(infilter file1) <(infilter file2) | outfilter
> make it easier to type (where the trick is to now write the correct sed
> scripts to serve as infilter and outfilter) than the alternative of
> having to use named fifos for limiting yourself to just POSIX semantics.

Hum, isn't such number conversion filtering exactly what numfmt
wasn't designed for?  But wait ...

  $ numfmt --field 1 --format='%020f' < f2
              99980081    1
             100002129   1
             100002136   2
             100002162   3

... it doesn't support leading zeros, unfortunately. ;-/
Wouldn't this be a nice enhancement?

Have a nice day,
Berny




Information forwarded to bug-coreutils <at> gnu.org:
bug#16468; Package coreutils. (Fri, 17 Jan 2014 02:22:02 GMT) Full text and rfc822 format available.

Message #20 received at 16468 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Bernhard Voelker <mail <at> bernhard-voelker.de>
Cc: barry kesner <modockesner <at> gmail.com>, Eric Blake <eblake <at> redhat.com>,
 16468 <at> debbugs.gnu.org
Subject: Re: bug#16468: join
Date: Fri, 17 Jan 2014 02:21:33 +0000
On 01/17/2014 12:00 AM, Bernhard Voelker wrote:
> On 01/16/2014 07:10 PM, Eric Blake wrote:
>> On 01/16/2014 10:46 AM, barry kesner wrote:
>>>   How do you tell join this without resorting.  The files are huge!
>>
>> Unfortunately, there isn't any really good way, short of re-processing
>> the files to make the data appear sorted in the order join expects.
>> That said, it certainly appears that for your given data, you can write
>> a sed filter that can reprocess on a line-by-line basis, and feed that
>> into join, without the penalty of having to re-sort the entire file and
>> without having to have the processed file stored in your file system all
>> at once.  It also seems possible to write a post filter to get back to
>> the style of the line in the original file.  Here, extensions such as bash's
>>   join <(infilter file1) <(infilter file2) | outfilter
>> make it easier to type (where the trick is to now write the correct sed
>> scripts to serve as infilter and outfilter) than the alternative of
>> having to use named fifos for limiting yourself to just POSIX semantics.
> 
> Hum, isn't such number conversion filtering exactly what numfmt
> wasn't designed for?  But wait ...
> 
>   $ numfmt --field 1 --format='%020f' < f2
>               99980081    1
>              100002129   1
>              100002136   2
>              100002162   3
> 
> ... it doesn't support leading zeros, unfortunately. ;-/
> Wouldn't this be a nice enhancement?

Yes it really should support standard formatting directives.
leading zeros, precision in the format, etc.

thanks,
Pádraig.




Information forwarded to bug-coreutils <at> gnu.org:
bug#16468; Package coreutils. (Wed, 30 Apr 2014 23:54:02 GMT) Full text and rfc822 format available.

Message #23 received at 16468 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Bernhard Voelker <mail <at> bernhard-voelker.de>
Cc: barry kesner <modockesner <at> gmail.com>, Eric Blake <eblake <at> redhat.com>,
 16468 <at> debbugs.gnu.org
Subject: Re: bug#16468: join
Date: Thu, 01 May 2014 00:53:16 +0100
[Message part 1 (text/plain, inline)]
On 01/17/2014 12:00 AM, Bernhard Voelker wrote:
> On 01/16/2014 07:10 PM, Eric Blake wrote:
>> On 01/16/2014 10:46 AM, barry kesner wrote:
>>>   How do you tell join this without resorting.  The files are huge!
>>
>> Unfortunately, there isn't any really good way, short of re-processing
>> the files to make the data appear sorted in the order join expects.
>> That said, it certainly appears that for your given data, you can write
>> a sed filter that can reprocess on a line-by-line basis, and feed that
>> into join, without the penalty of having to re-sort the entire file and
>> without having to have the processed file stored in your file system all
>> at once.  It also seems possible to write a post filter to get back to
>> the style of the line in the original file.  Here, extensions such as bash's
>>   join <(infilter file1) <(infilter file2) | outfilter
>> make it easier to type (where the trick is to now write the correct sed
>> scripts to serve as infilter and outfilter) than the alternative of
>> having to use named fifos for limiting yourself to just POSIX semantics.
> 
> Hum, isn't such number conversion filtering exactly what numfmt
> wasn't designed for?  But wait ...
> 
>   $ numfmt --field 1 --format='%020f' < f2
>               99980081    1
>              100002129   1
>              100002136   2
>              100002162   3
> 
> ... it doesn't support leading zeros, unfortunately. ;-/
> Wouldn't this be a nice enhancement?

I've needed this a few times so I added it in the attached.

thanks,
Pádraig.

[numfmt-leading-zeros.patch (text/x-patch, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#16468; Package coreutils. (Thu, 01 May 2014 22:27:02 GMT) Full text and rfc822 format available.

Message #26 received at 16468 <at> debbugs.gnu.org (full text, mbox):

From: Bernhard Voelker <mail <at> bernhard-voelker.de>
To: Pádraig Brady <P <at> draigBrady.com>
Cc: barry kesner <modockesner <at> gmail.com>, Eric Blake <eblake <at> redhat.com>,
 16468 <at> debbugs.gnu.org
Subject: Re: bug#16468: join
Date: Fri, 02 May 2014 00:26:02 +0200
On 05/01/2014 01:53 AM, Pádraig Brady wrote:
> I added it in the attached.

Thanks, great stuff.

> diff --git a/NEWS b/NEWS
> index 7855a48..904aace 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -66,6 +66,9 @@ GNU coreutils NEWS                                    -*- outline -*-
>    causing name look-up errors.  Also look-ups are first done outside the chroot,
>    in case the look-up within the chroot fails due to library conflicts etc.
>  
> +  numfmt supports zero padding of numbers using the standard --printf
> +  syntax of a leading zero, for example --format="%010f".
> +

s/--printf/printf/

> diff --git a/src/numfmt.c b/src/numfmt.c
> index 63411f3..c744875 100644
> --- a/src/numfmt.c
> +++ b/src/numfmt.c
...
> @@ -992,6 +1023,9 @@ parse_format_string (char const *fmt)
>  
>    if (endptr != (fmt + i) && pad != 0)
>      {
> +      if (debug && padding_width && !(zero_padding && pad > 0))
> +        error (0, 0, _("--format padding overridding --padding"));
> +

In --debug mode, it seems odd that the format with the new
zero-padding does not lead to a warning ...

  $ src/numfmt --debug --format="%09f" --padding=2 1234
  000001234

while a format without does:

  $ src/numfmt --debug --format="%9f" --padding=2 1234
  src/numfmt: --format padding overridding --padding
       1234

+1 otherwise.

Thanks & have a nice day,
Berny




Information forwarded to bug-coreutils <at> gnu.org:
bug#16468; Package coreutils. (Fri, 02 May 2014 00:59:02 GMT) Full text and rfc822 format available.

Message #29 received at 16468 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Bernhard Voelker <mail <at> bernhard-voelker.de>
Cc: barry kesner <modockesner <at> gmail.com>, Eric Blake <eblake <at> redhat.com>,
 16468 <at> debbugs.gnu.org
Subject: Re: bug#16468: join
Date: Fri, 02 May 2014 01:58:50 +0100
On 05/01/2014 11:26 PM, Bernhard Voelker wrote:
> On 05/01/2014 01:53 AM, Pádraig Brady wrote:
>> I added it in the attached.
> 
> Thanks, great stuff.
> 
>> diff --git a/NEWS b/NEWS
>> index 7855a48..904aace 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -66,6 +66,9 @@ GNU coreutils NEWS                                    -*- outline -*-
>>    causing name look-up errors.  Also look-ups are first done outside the chroot,
>>    in case the look-up within the chroot fails due to library conflicts etc.
>>  
>> +  numfmt supports zero padding of numbers using the standard --printf
>> +  syntax of a leading zero, for example --format="%010f".
>> +
> 
> s/--printf/printf/

done

>> diff --git a/src/numfmt.c b/src/numfmt.c
>> index 63411f3..c744875 100644
>> --- a/src/numfmt.c
>> +++ b/src/numfmt.c
> ...
>> @@ -992,6 +1023,9 @@ parse_format_string (char const *fmt)
>>  
>>    if (endptr != (fmt + i) && pad != 0)
>>      {
>> +      if (debug && padding_width && !(zero_padding && pad > 0))
>> +        error (0, 0, _("--format padding overridding --padding"));
>> +
> 
> In --debug mode, it seems odd that the format with the new
> zero-padding does not lead to a warning ...
> 
>   $ src/numfmt --debug --format="%09f" --padding=2 1234
>   000001234

In this case the number of leading zeros and --padding are separate.
Since they're zero padded you can freely move around the numbers in a field like:
  numfmt --header --field=2 --format="%010f" --padding=-15 < /proc/interrupts

> while a format without does:
> 
>   $ src/numfmt --debug --format="%9f" --padding=2 1234
>   src/numfmt: --format padding overridding --padding
>        1234

Here the --padding is overridden hence the warning.

thanks for the review!

I've now pushed it.

Pádraig.




Added tag(s) notabug. Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Thu, 11 Oct 2018 22:16:03 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 16468 <at> debbugs.gnu.org and barry kesner <modockesner <at> gmail.com> Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Thu, 11 Oct 2018 22:16:03 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 09 Nov 2018 12:24:08 GMT) Full text and rfc822 format available.

This bug report was last modified 5 years and 141 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.