GNU bug report logs - #13089
Wish: split every n'th into n pipes

Previous Next

Package: coreutils;

Reported by: Ole Tange <tange <at> gnu.org>

Date: Wed, 5 Dec 2012 17:10:02 UTC

Severity: normal

Tags: notabug

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 13089 in the body.
You can then email your comments to 13089 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#13089; Package coreutils. (Wed, 05 Dec 2012 17:10:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Ole Tange <tange <at> gnu.org>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Wed, 05 Dec 2012 17:10:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Ole Tange <tange <at> gnu.org>
To: bug-coreutils <at> gnu.org
Subject: Wish: split every n'th into n pipes
Date: Wed, 5 Dec 2012 15:11:32 +0100
I often have data that can be processed in parallel.

It would be great if split --filter could look at every n'th line
instead of chunking into n chunks:

  cat bigfile | split --every-nth -n 8 --filter "grep foo"

The above should start 8 greps and give each a line in round robin manner.

Ideally it should be possible to do so non-blocking so if some lines
take longer for one instance of grep, then the rest of the greps are
not blocked.


/Ole




Information forwarded to bug-coreutils <at> gnu.org:
bug#13089; Package coreutils. (Wed, 05 Dec 2012 18:53:02 GMT) Full text and rfc822 format available.

Message #8 received at 13089 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Ole Tange <tange <at> gnu.org>
Cc: 13089 <at> debbugs.gnu.org
Subject: Re: bug#13089: Wish: split every n'th into n pipes
Date: Wed, 05 Dec 2012 18:52:29 +0000
tag 13089 + notabug
close 13089

On 12/05/2012 02:11 PM, Ole Tange wrote:
> I often have data that can be processed in parallel.
>
> It would be great if split --filter could look at every n'th line
> instead of chunking into n chunks:
>
>    cat bigfile | split --every-nth -n 8 --filter "grep foo"
>
> The above should start 8 greps and give each a line in round robin manner.
>
> Ideally it should be possible to do so non-blocking so if some lines
> take longer for one instance of grep, then the rest of the greps are
> not blocked.

So that's mostly supported already (notice the r/ below):

$ seq 8000 | split -n r/8 --filter='wc -l' | uniq -c
      8 1000

The concurrency is achieved through standard I/O buffers
between split and the filters (note also the -u split option).

I'm not sure non blocking I/O would be of much benefit,
since the filters will be the same, and if we did that,
then we'd have to worry about internal buffering in split.
We had a similar question about tee, yesterday, and I
think the answer is the same here, that the complexity
doesn't seem warranted for such edge cases.

thanks,
Pádraig.




Added tag(s) notabug. Request was from Pádraig Brady <P <at> draigBrady.com> to control <at> debbugs.gnu.org. (Wed, 05 Dec 2012 19:32:02 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 13089 <at> debbugs.gnu.org and Ole Tange <tange <at> gnu.org> Request was from Pádraig Brady <P <at> draigBrady.com> to control <at> debbugs.gnu.org. (Wed, 05 Dec 2012 19:32:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-coreutils <at> gnu.org:
bug#13089; Package coreutils. (Thu, 06 Dec 2012 13:03:02 GMT) Full text and rfc822 format available.

Message #15 received at 13089 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Ole Tange <tange <at> gnu.org>
Cc: 13089 <at> debbugs.gnu.org
Subject: Re: bug#13089: Wish: split every n'th into n pipes
Date: Thu, 06 Dec 2012 13:02:34 +0000
On 12/06/2012 12:20 PM, Ole Tange wrote:
> On Thu, Dec 6, 2012 at 12:41 PM, Pádraig Brady <P <at> draigbrady.com> wrote:
>> On 12/06/2012 11:25 AM, Pádraig Brady wrote:
>>> On 12/06/2012 12:06 AM, Ole Tange wrote:
>>>>
>>>> Do you have a similar reference:
>>>>
>>>> * if each record is k lines (e.g. 4 lines as is the case in FASTQ files)
>>>> * If each record has a record separator (e.g. > in FASTA files)
>>>
>>> I'd probably preprocess first to a single line:
>>>
>>> The following may not be robust or efficient.
>>> I suspect there may be tools already to efficiently
>>> parse fast[aq] to a single line:
>>>
>>>     fastalines(){ sed -n '/^>/!{H;$!b};s/$/\x00/;x;1b;s/\n//g;p'; }
>>>     fastqlines(){ sed -n '/^@/!{H;$!b};s/$/\x00/;x;1b;s/\n//g;p'; }
>>>
>>> Then use like:
>>>
>>>   fasta_source | fastalines |
>>>   split -n r/8 --filter='tr '\0' '\n'; process_fasta'
>
> Here you assume that the quality score never reaches '@'. You cannot
> do that, because it sometimes reaches @. The only thing you can be
> sure of is every record is 4 lines.

Sure. I mentioned they might not be robust. These may be better:

fastalines(){ sed '1!s/^>/\x00&/' | tr '\n\0' '\0\n'; }
fastqlines(){ paste -d $'\1' - - - - | tr '\1' '\0' }

> I was hoping for a general solution that would work no matter the
> content. Your solution breaks if the content contain \0 (NULs are not
> in FAST[AQ] files, but may be in other formats).

Fair point, but you can use the general technique
of transforming (encoding) NULs to something else
before processing, in the unlikely case they're
present in the input.

> Do you see support coming for n-line records in split?

Given the above options, probably not.
Maybe we could add support for --zero-terminated
to treat \0 as the delimiter rather than \n,
which might simplify postprocessing required?

> Do you see support coming for records split on regexp in split?

Given the complexity, probably not.
regexps would be better maintained within sed etc.
which could do the annotation for later splitting.

Note also the `cpslit` util, but I don't see us updating
that to supporting a fixed number of outputs like `split` either.

cheers,
Pádraig.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 04 Jan 2013 12:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 11 years and 125 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.