GNU bug report logs - #11950
cp: Recursively copy ordered for maximal reading speed

Previous Next

Package: coreutils;

Reported by: Michael <codejodler <at> gmx.ch>

Date: Mon, 16 Jul 2012 15:26:02 UTC

Severity: normal

Tags: moreinfo, notabug

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 11950 in the body.
You can then email your comments to 11950 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#11950; Package coreutils. (Mon, 16 Jul 2012 15:26:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Michael <codejodler <at> gmx.ch>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Mon, 16 Jul 2012 15:26:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Michael <codejodler <at> gmx.ch>
To: bug-coreutils <at> gnu.org
Subject: cp: Recursively copy ordered for maximal reading speed
Date: Mon, 16 Jul 2012 01:53:06 +0200
Hello,

After coding several backup tools there's something in my mind since years. When 'cp' copies files from magnetic harddisks (commonly called after their adapter or bus - SATA, IDE, and the like, i'm not talking about solid state) recursively, it seems to pick up the files in 'raw' order, just as the disk buffer spit them out (like 'in one head move'). Or so. It does not resemble any alphabetical order, for example, it does not even stay within the same parent folder (flingering hither and forth, as the files come in).

I suppose that's the fastest order, fastest for reading. However, one could consider another 'maximal speed': The (later) read access of the copied files.

(Among the reasons that files are not sorted physically on disk are FS driver gap optimizing code, and user actions like deleting single files, or moving into another place. It could be called 'physically folder fragmentation', something happening sooner or later, if you work on files, anyway. I'd like to propose a way to avoid this specific fragmentation when copying.)

For example, take a large image gallery, sorted into several folders and all files sorted alphanum. This is a standard example. Now what will file managers, or image viewers, do with these files ? They will read in one folders content, and display the files sorted alphanum. Usually, they even create thumbnails, so they really access any file separately, and in the said order.
This is creating quite some disk head moves, because they are not stored in that order 'physically' on disk. Meaning, it is slow, even if the disk is fast and have a fast buffer, compared to the rarely existing case when the files would be stored physically just in their access order. I hope the idea got clear....

Now my proposal is to have a recursive 'ordered' mode, where cp copies the files of one folder in their alphanumeric sorting (which should be the view mode in 99% of all cases out there). It would slow down the copy process a bit, for the benefit of later reading speed.

Now you may ask what is it good for. Aren't backups just that, and noone ever opens them with file managers or viewers, regularly ?
But 'cp' is not only used for backups. It is also used to copy the files from the camera chip to the harddisk in the first place, or to copy over to network drives. I believe it is most as backend in most desktop applications anyway, and probably in most servers too.

It still is true that most people want maximal copy speed, not maximal reading. But maybe that's partly just because they don't know the choice even exists. If there was such a recursive option, then backup or download tools at least could offer it in their settings too. I would certainly use it in my backup code, because i'm dealing with massive backups, where (maybe unobviously) speed does not matter so much exactly for that reason: Because it needs hours anyway. I do not need speed with backup. I need speed when reading.

I'm a DJ with huge music collection, and also a massive photographer and doing lots of movie clips too, doing backups since more than 10 years, and i am absolutely sure about this choice. I just think that there is a grain of meaning in my proposal.

I'm not on any bug list, i hope this can be accepted just as a mail. Let me know if and how i can do it better.


Kind regards, Michael








 






Information forwarded to bug-coreutils <at> gnu.org:
bug#11950; Package coreutils. (Mon, 16 Jul 2012 20:59:02 GMT) Full text and rfc822 format available.

Message #8 received at 11950 <at> debbugs.gnu.org (full text, mbox):

From: "Alan Curry" <pacman-cu <at> kosh.dhis.org>
To: codejodler <at> gmx.ch (Michael)
Cc: 11950 <at> debbugs.gnu.org
Subject: Re: bug#11950: cp: Recursively copy ordered for maximal reading speed
Date: Mon, 16 Jul 2012 15:52:42 -0500 (GMT+5)
Michael writes:
> 
> Hello,
> 
> After coding several backup tools there's something in my mind since years. When 'cp' copies files from magnetic harddisks (commonly called after their adapter or bus - SATA, IDE, and the like, i'm not talking about solid state) recursively, it seems to pick up the files in 'raw' order, just as the disk buffer spit them out (like 'in one head move'). Or so. It does not resemble any alphabetical order, for example, it does not even stay within the same parent folder (flingering hither and forth, as the files come in).

[grumble at User-Agent: claws-mail.org: One line per paragraph isn't good
mail formatting!]

It's called directory order. It used to be simply order of creation of
files, with deletions creating gaps that could be filled by later
creations with same-length or shorter names.

But on most new filesystems, directories are stored in a non-linear
structure so that lookups in a large directory don't have to scan
through every name. For ext2/ext3/ext4, run tune2fs -l on the block
device and look for the dir_index option.

If you're copying files onto a filesystem with dir_index enabled, the
order in which cp creates them should have little effect on the
directory's layout afterward. If you're not using dir_index on the
destination filesystem, there's your problem! Enable dir_index and all
directory lookups will be fast.

None of this has anything to do with where the actual data blocks of the
file will be allocated. There's no way to control that. If you think
that the second file created is going to be adjacent to the first file
created... that's never been guaranteed. Filesystem block allocators are
way more mysterious than that.

If you really think there's something to be gained here, prove it: start
with a directory with a lot of files but no subdirectories. Do an
alphabetical-order copy like this:

$ mkdir other_directory ; cp ./* other_directory

(The glob returns the names in sorted order so this gives you the
creation order you want, unlike cp -r)

Then get it all out of cache so the read test will hit the disk as much
as possible:

$ sync ; echo 3 > /proc/sys/vm/drop_caches

And read back the files:

$ cd other_directory ; time cat ./* > /dev/null

Now repeat, but using cp -r to create the other directory so the files
get copied in the source directory order. And repeat again, but using

$ find . -type f -exec cat '{}' + > /dev/null

instead of the cat ./* (the glob will cat the files in sorted order, the
find will use directory order).

If there are any significant differences in the times, and dir_index is
enabled, you're onto something. With dir_index disabled, you should get
worse times all around, but not a lot worse if the files are big enough
that the time spent reading their contents overshadows the time spent on
directory lookups.

-- 
Alan Curry




Information forwarded to bug-coreutils <at> gnu.org:
bug#11950; Package coreutils. (Tue, 17 Jul 2012 04:33:01 GMT) Full text and rfc822 format available.

Message #11 received at 11950 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: "Alan Curry" <pacman-cu <at> kosh.dhis.org>
Cc: Michael <codejodler <at> gmx.ch>, 11950 <at> debbugs.gnu.org
Subject: Re: bug#11950: cp: Recursively copy ordered for maximal reading speed
Date: Tue, 17 Jul 2012 06:26:35 +0200
tags 11950 moreinfo
thanks

Alan Curry wrote:
...
> It's called directory order. It used to be simply order of creation of
> files, with deletions creating gaps that could be filled by later
> creations with same-length or shorter names.

Thanks for the report Michael,
and thanks for replying, Alan.

Michael, you may have noticed that your email automatically
created an "issue" in our bug tracker: any email discussion on
this thread ends up being archived here: http://bugs.gnu.org/11950
Please let us know how your experiments go.




Added tag(s) moreinfo. Request was from Jim Meyering <jim <at> meyering.net> to control <at> debbugs.gnu.org. (Tue, 17 Jul 2012 04:33:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-coreutils <at> gnu.org:
bug#11950; Package coreutils. (Sat, 15 Sep 2012 10:16:02 GMT) Full text and rfc822 format available.

Message #16 received at 11950 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: 11950 <at> debbugs.gnu.org
Subject: Re: bug#11950: cp: Recursively copy ordered for maximal reading speed
Date: Sat, 15 Sep 2012 12:14:43 +0200
tags 11950 notabug
close 11950
thanks

Thanks for your interest.
Since this is not a bug in coreutils, I'm marking this issue as such
(notabug) and closing it.  Any additional discussion is still fine
and will be archived along with the rest at http://bugs.gnu.org/11950.




Added tag(s) notabug. Request was from Jim Meyering <jim <at> meyering.net> to control <at> debbugs.gnu.org. (Sat, 15 Sep 2012 10:16:02 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 11950 <at> debbugs.gnu.org and Michael <codejodler <at> gmx.ch> Request was from Jim Meyering <jim <at> meyering.net> to control <at> debbugs.gnu.org. (Sat, 15 Sep 2012 10:16:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 13 Oct 2012 11:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 11 years and 199 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.