GNU bug report logs -
#13089
Wish: split every n'th into n pipes
Previous Next
Reported by: Ole Tange <tange <at> gnu.org>
Date: Wed, 5 Dec 2012 17:10:02 UTC
Severity: normal
Tags: notabug
Done: Pádraig Brady <P <at> draigBrady.com>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 13089 in the body.
You can then email your comments to 13089 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#13089
; Package
coreutils
.
(Wed, 05 Dec 2012 17:10:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Ole Tange <tange <at> gnu.org>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Wed, 05 Dec 2012 17:10:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
I often have data that can be processed in parallel.
It would be great if split --filter could look at every n'th line
instead of chunking into n chunks:
cat bigfile | split --every-nth -n 8 --filter "grep foo"
The above should start 8 greps and give each a line in round robin manner.
Ideally it should be possible to do so non-blocking so if some lines
take longer for one instance of grep, then the rest of the greps are
not blocked.
/Ole
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#13089
; Package
coreutils
.
(Wed, 05 Dec 2012 18:53:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 13089 <at> debbugs.gnu.org (full text, mbox):
tag 13089 + notabug
close 13089
On 12/05/2012 02:11 PM, Ole Tange wrote:
> I often have data that can be processed in parallel.
>
> It would be great if split --filter could look at every n'th line
> instead of chunking into n chunks:
>
> cat bigfile | split --every-nth -n 8 --filter "grep foo"
>
> The above should start 8 greps and give each a line in round robin manner.
>
> Ideally it should be possible to do so non-blocking so if some lines
> take longer for one instance of grep, then the rest of the greps are
> not blocked.
So that's mostly supported already (notice the r/ below):
$ seq 8000 | split -n r/8 --filter='wc -l' | uniq -c
8 1000
The concurrency is achieved through standard I/O buffers
between split and the filters (note also the -u split option).
I'm not sure non blocking I/O would be of much benefit,
since the filters will be the same, and if we did that,
then we'd have to worry about internal buffering in split.
We had a similar question about tee, yesterday, and I
think the answer is the same here, that the complexity
doesn't seem warranted for such edge cases.
thanks,
Pádraig.
Added tag(s) notabug.
Request was from
Pádraig Brady <P <at> draigBrady.com>
to
control <at> debbugs.gnu.org
.
(Wed, 05 Dec 2012 19:32:02 GMT)
Full text and
rfc822 format available.
bug closed, send any further explanations to
13089 <at> debbugs.gnu.org and Ole Tange <tange <at> gnu.org>
Request was from
Pádraig Brady <P <at> draigBrady.com>
to
control <at> debbugs.gnu.org
.
(Wed, 05 Dec 2012 19:32:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#13089
; Package
coreutils
.
(Thu, 06 Dec 2012 13:03:02 GMT)
Full text and
rfc822 format available.
Message #15 received at 13089 <at> debbugs.gnu.org (full text, mbox):
On 12/06/2012 12:20 PM, Ole Tange wrote:
> On Thu, Dec 6, 2012 at 12:41 PM, Pádraig Brady <P <at> draigbrady.com> wrote:
>> On 12/06/2012 11:25 AM, Pádraig Brady wrote:
>>> On 12/06/2012 12:06 AM, Ole Tange wrote:
>>>>
>>>> Do you have a similar reference:
>>>>
>>>> * if each record is k lines (e.g. 4 lines as is the case in FASTQ files)
>>>> * If each record has a record separator (e.g. > in FASTA files)
>>>
>>> I'd probably preprocess first to a single line:
>>>
>>> The following may not be robust or efficient.
>>> I suspect there may be tools already to efficiently
>>> parse fast[aq] to a single line:
>>>
>>> fastalines(){ sed -n '/^>/!{H;$!b};s/$/\x00/;x;1b;s/\n//g;p'; }
>>> fastqlines(){ sed -n '/^@/!{H;$!b};s/$/\x00/;x;1b;s/\n//g;p'; }
>>>
>>> Then use like:
>>>
>>> fasta_source | fastalines |
>>> split -n r/8 --filter='tr '\0' '\n'; process_fasta'
>
> Here you assume that the quality score never reaches '@'. You cannot
> do that, because it sometimes reaches @. The only thing you can be
> sure of is every record is 4 lines.
Sure. I mentioned they might not be robust. These may be better:
fastalines(){ sed '1!s/^>/\x00&/' | tr '\n\0' '\0\n'; }
fastqlines(){ paste -d $'\1' - - - - | tr '\1' '\0' }
> I was hoping for a general solution that would work no matter the
> content. Your solution breaks if the content contain \0 (NULs are not
> in FAST[AQ] files, but may be in other formats).
Fair point, but you can use the general technique
of transforming (encoding) NULs to something else
before processing, in the unlikely case they're
present in the input.
> Do you see support coming for n-line records in split?
Given the above options, probably not.
Maybe we could add support for --zero-terminated
to treat \0 as the delimiter rather than \n,
which might simplify postprocessing required?
> Do you see support coming for records split on regexp in split?
Given the complexity, probably not.
regexps would be better maintained within sed etc.
which could do the annotation for later splitting.
Note also the `cpslit` util, but I don't see us updating
that to supporting a fixed number of outputs like `split` either.
cheers,
Pádraig.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Fri, 04 Jan 2013 12:24:03 GMT)
Full text and
rfc822 format available.
This bug report was last modified 11 years and 125 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.