GNU bug report logs - #13354
sort: File Merge Order is Suboptimal with Many Passes

Previous Next

Package: coreutils;

Reported by: Jason Bucata <jbucata <at> windstream.net>

Date: Fri, 4 Jan 2013 07:46:01 UTC

Severity: normal

Tags: moreinfo

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 13354 in the body.
You can then email your comments to 13354 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#13354; Package coreutils. (Fri, 04 Jan 2013 07:46:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Jason Bucata <jbucata <at> windstream.net>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Fri, 04 Jan 2013 07:46:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Jason Bucata <jbucata <at> windstream.net>
To: bug-coreutils <at> gnu.org
Subject: sort: File Merge Order is Suboptimal with Many Passes
Date: Thu, 3 Jan 2013 22:07:00 -0600
Over a year ago, I was working with sort on some large input files.  I
noticed a performance problem that related to the number of temp files being
used for sort runs.  As I recall, I'd also seen the problem prior to that,
but this was the first time I took an opportunity to diagnose the problem
in detail.

(As for why I waited so long to finally report it, more below.)

From what I recall, it tended to want to do 8-way merges on my system.  If
the input file was just big enough, it would wind up with 9 files left over
toward the end.  It would merge the largest 8 files in another pass, and
then do one final merge to combine the now-huge temp file with the one,
likely very tiny temp file still remaining.

I found that the extra pass added a lot of extra time to the sort.  If I
bumped up the memory buffer to increase the initial run sizes so that I got
rid of one temp file, I could shave at least a few minutes off the total run
time: The little extra memory made a disproportionate difference, and I
could attribute it directly to the extra merge pass at the end that was
avoided.

At the time I reasoned that it would make the most sense to merge temp files
at the very beginning to reduce the number just enough to have a bunch of
--batch-size merge passes after that until the end.  In this case, if the
extra temp file at the end was 10 MB, then each merge pass would have to
process 10 MB more.  Over 4 merge passes, say, that's 40 MB extra to read.
But the benefit is saving the final merge pass with 10 MB plus something a
WHOLE log bigger than 40 MB, if it's big enough to require 4 merge passes to
combine it all in the first place!

I held off on reporting the bug because I wanted to see what Knuth had to
say on the subject.  Well, tonight I finally went to the library and looked
through volume 3 to see his advice. :)

In section 5.4.9 he basically echoes my thinking.

In the simple case where all runs were the same length, he recommends a
queue to track the files to process, with N files pulled from the front of
the queue and the merged output added to the back.  But, he says to add a
number of "dummy" empty files to the front queue.  The net effect is that
fewer than N files are merged the first time.

Looking at the merge function in sort.c, it appears that it can merge both
temp files and input files all at the same time.  With that, we probably
want an enhancement to that algorithm, which Knuth called "Huffman's method"
(referencing something in chapter 2).  For files or sort runs of varying
sizes, Knuth says to use a priority queue, to merge the smallest N files at
each pass.  He also says to add a number of 0-length dummy files at the
front of the priority queue so that the first merge pass grabs fewer than N
files.

If I understand his formula correctly, he suggests to add (1-F) % (N-1)
dummy runs, where N is --batch-size and F is the starting number of files.

To get it ideal, we'd need a priority queue implementation here, maybe a
heap or something.  (Irony of having to maintain a sorted list of files in a
sort program is duly noted...)  It would be an improvement, though, to
handle merging a smaller batch of files at the beginning, the equivalent of
adding 0-byte dummy files at the front.  That might be quicker to implement,
unless there's a GPL'd heap library lying around somewhere...

Jason B.

-- 
Half the harm that is done in this world is due to people who want to feel
important.
	-- T. S. Eliot




Information forwarded to bug-coreutils <at> gnu.org:
bug#13354; Package coreutils. (Fri, 04 Jan 2013 11:03:01 GMT) Full text and rfc822 format available.

Message #8 received at 13354 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Jason Bucata <jbucata <at> windstream.net>
Cc: 13354 <at> debbugs.gnu.org
Subject: Re: bug#13354: sort: File Merge Order is Suboptimal with Many Passes
Date: Fri, 04 Jan 2013 11:01:50 +0000
On 01/04/2013 04:07 AM, Jason Bucata wrote:
> Over a year ago, I was working with sort on some large input files.  I
> noticed a performance problem that related to the number of temp files being
> used for sort runs.  As I recall, I'd also seen the problem prior to that,
> but this was the first time I took an opportunity to diagnose the problem
> in detail.
>
> (As for why I waited so long to finally report it, more below.)
>
>>From what I recall, it tended to want to do 8-way merges on my system.  If
> the input file was just big enough, it would wind up with 9 files left over
> toward the end.  It would merge the largest 8 files in another pass, and
> then do one final merge to combine the now-huge temp file with the one,
> likely very tiny temp file still remaining.
>
> I found that the extra pass added a lot of extra time to the sort.  If I
> bumped up the memory buffer to increase the initial run sizes so that I got
> rid of one temp file, I could shave at least a few minutes off the total run
> time: The little extra memory made a disproportionate difference, and I
> could attribute it directly to the extra merge pass at the end that was
> avoided.
>
> At the time I reasoned that it would make the most sense to merge temp files
> at the very beginning to reduce the number just enough to have a bunch of
> --batch-size merge passes after that until the end.  In this case, if the
> extra temp file at the end was 10 MB, then each merge pass would have to
> process 10 MB more.  Over 4 merge passes, say, that's 40 MB extra to read.
> But the benefit is saving the final merge pass with 10 MB plus something a
> WHOLE log bigger than 40 MB, if it's big enough to require 4 merge passes to
> combine it all in the first place!
>
> I held off on reporting the bug because I wanted to see what Knuth had to
> say on the subject.  Well, tonight I finally went to the library and looked
> through volume 3 to see his advice. :)
>
> In section 5.4.9 he basically echoes my thinking.
>
> In the simple case where all runs were the same length, he recommends a
> queue to track the files to process, with N files pulled from the front of
> the queue and the merged output added to the back.  But, he says to add a
> number of "dummy" empty files to the front queue.  The net effect is that
> fewer than N files are merged the first time.
>
> Looking at the merge function in sort.c, it appears that it can merge both
> temp files and input files all at the same time.  With that, we probably
> want an enhancement to that algorithm, which Knuth called "Huffman's method"
> (referencing something in chapter 2).  For files or sort runs of varying
> sizes, Knuth says to use a priority queue, to merge the smallest N files at
> each pass.  He also says to add a number of 0-length dummy files at the
> front of the priority queue so that the first merge pass grabs fewer than N
> files.
>
> If I understand his formula correctly, he suggests to add (1-F) % (N-1)
> dummy runs, where N is --batch-size and F is the starting number of files.
>
> To get it ideal, we'd need a priority queue implementation here, maybe a
> heap or something.  (Irony of having to maintain a sorted list of files in a
> sort program is duly noted...)  It would be an improvement, though, to
> handle merging a smaller batch of files at the beginning, the equivalent of
> adding 0-byte dummy files at the front.  That might be quicker to implement,
> unless there's a GPL'd heap library lying around somewhere...

There is a little heap lib already used by sort:
http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/heap.c;hb=HEAD
Would that suffice?

thanks,
Pádraig.




Information forwarded to bug-coreutils <at> gnu.org:
bug#13354; Package coreutils. (Fri, 04 Jan 2013 15:52:49 GMT) Full text and rfc822 format available.

Message #11 received at 13354 <at> debbugs.gnu.org (full text, mbox):

From: Jason Bucata <jbucata <at> windstream.net>
To: P�draig Brady <P <at> draigBrady.com>
Cc: 13354 <at> debbugs.gnu.org
Subject: Re: bug#13354: sort: File Merge Order is Suboptimal with Many Passes
Date: Fri, 4 Jan 2013 09:48:27 -0600
On Fri, Jan 04, 2013 at 11:01:50AM +0000, P�draig Brady wrote:
> On 01/04/2013 04:07 AM, Jason Bucata wrote:
> > To get it ideal, we'd need a priority queue implementation here, maybe a
> > heap or something.  (Irony of having to maintain a sorted list of files in a
> > sort program is duly noted...)  It would be an improvement, though, to
> > handle merging a smaller batch of files at the beginning, the equivalent of
> > adding 0-byte dummy files at the front.  That might be quicker to implement,
> > unless there's a GPL'd heap library lying around somewhere...
> 
> There is a little heap lib already used by sort:
> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/heap.c;hb=HEAD
> Would that suffice?
> 
> thanks,
> Pádraig.

Oh, good to know.  If you're asking me, I'm sure it's fine, though I guess
it's up to whoever will write the fix.

I don't know what it would take for me to be able to get a copyright
disclaimer from my employer, so I'm probably not in a position to develop a
patch for this myself.  I'm hoping some established developer will see this
and be able to step in.

Jason B.

-- 
Half the harm that is done in this world is due to people who want to feel
important.
	-- T. S. Eliot




Information forwarded to bug-coreutils <at> gnu.org:
bug#13354; Package coreutils. (Thu, 18 Oct 2018 23:14:02 GMT) Full text and rfc822 format available.

Message #14 received at 13354 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Jason Bucata <jbucata <at> windstream.net>
Cc: 13354 <at> debbugs.gnu.org
Subject: Re: bug#13354: sort: File Merge Order is Suboptimal with Many Passes
Date: Thu, 18 Oct 2018 17:13:08 -0600
tags 13354 moreinfo
close 13354
stop

(triaging old bugs)

Hello,

On 04/01/13 08:48 AM, Jason Bucata wrote:
> On Fri, Jan 04, 2013 at 11:01:50AM +0000, P�draig Brady wrote:
>> On 01/04/2013 04:07 AM, Jason Bucata wrote:
>>> To get it ideal, we'd need a priority queue implementation here, maybe a
>>> heap or something.
>>
>> There is a little heap lib already used by sort:
>> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/heap.c;hb=HEAD
>> Would that suffice?
> 
> Oh, good to know.  If you're asking me, I'm sure it's fine, though I guess
> it's up to whoever will write the fix.

With no further follow-ups in 5 years, I'm closing this item.
Discussion can continue by replying to this thread.

regards,
 - assaf




Added tag(s) moreinfo. Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Thu, 18 Oct 2018 23:14:02 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 13354 <at> debbugs.gnu.org and Jason Bucata <jbucata <at> windstream.net> Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Thu, 18 Oct 2018 23:14:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 16 Nov 2018 12:24:06 GMT) Full text and rfc822 format available.

This bug report was last modified 5 years and 161 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.