GNU bug report logs - #14224
Feature request for the `cut`: record delimiter

Previous Next

Package: coreutils;

Reported by: George Brink <siberianowl <at> gmail.com>

Date: Wed, 17 Apr 2013 22:40:02 UTC

Severity: wishlist

To reply to this bug, email your comments to 14224 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#14224; Package coreutils. (Wed, 17 Apr 2013 22:40:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to George Brink <siberianowl <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Wed, 17 Apr 2013 22:40:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: George Brink <siberianowl <at> gmail.com>
To: bug-coreutils <at> gnu.org
Subject: Feature request for the `cut`: record delimiter
Date: Wed, 17 Apr 2013 17:26:16 -0400
[Message part 1 (text/plain, inline)]
Hello,

I have a task of extracting several "fields" from the text file. The
standard `cut` tool could be a perfect tool for a job, but...
In my file the '\n' character is a legal symbol inside fields and therefore
the text file uses other symbol for record-separator. And the `cut` has a
hard-coded '\n' for record separator (I just checked the source from the
coreutils-8.21 package).
The fix for this should be a simple one. I can probably make it myself  but
where to send the patch?
The README in coreutils suggests to read README-hacking and HACKING for
guide-lines on making a patch, but there are no such files in the the
coreutils-8.21.tar.xz.
[Message part 2 (text/html, inline)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#14224; Package coreutils. (Wed, 17 Apr 2013 23:14:02 GMT) Full text and rfc822 format available.

Message #8 received at 14224 <at> debbugs.gnu.org (full text, mbox):

From: Bob Proulx <bob <at> proulx.com>
To: George Brink <siberianowl <at> gmail.com>
Cc: 14224 <at> debbugs.gnu.org
Subject: Re: bug#14224: Feature request for the `cut`: record delimiter
Date: Wed, 17 Apr 2013 17:09:13 -0600
severity 14224 wishlist
thanks

George Brink wrote:
> I have a task of extracting several "fields" from the text file. The
> standard `cut` tool could be a perfect tool for a job, but...

Thank you for the bug report.  However note that 'cut' is often not
the right tool for the job.  Almost always when people want more than
cut offers it is revealed that they should be using awk or other tool.

> In my file the '\n' character is a legal symbol inside fields and therefore
> the text file uses other symbol for record-separator.

Then it isn't a text file.  By definition.

  http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html

    3.392 Text File

    A file that contains characters organized into one or more lines.
    The lines do not contain NUL characters and none can exceed
    {LINE_MAX} bytes in length, including the <newline>.  Although IEEE
    Std 1003.1-2001 does not distinguish between text files and binary
    files (see the ISO C standard), many utilities only produce
    predictable or meaningful output when operating on text files.  The
    standard utilities that have such restrictions always specify "text
    files" in their STDIN or INPUT FILES sections.

  http://pubs.opengroup.org/onlinepubs/009695399/utilities/cut.html

  INPUT FILES

    The input files shall be text files, except that line lengths
    shall be unlimited.

Of course GNU isn't Unix (nor POSIX) and we can extend them usefully
if it makes sense to do so.  However creeping featurism is the Evil
and therefore will need discussion and justification.

Could you please give a discription of your input syntax in more
detail?  Usually people will suggest a better tool for the job and
that often solves the problem immediately.

> The fix for this should be a simple one. I can probably make it
> myself but where to send the patch?

Since it isn't a bug then it isn't a "fix".  It would be an enhancement.
I have set the bug severity appropriately.

> The README in coreutils suggests to read README-hacking and HACKING for
> guide-lines on making a patch, but there are no such files in the the
> coreutils-8.21.tar.xz.

Anyone working on the source code is expected to be working from the
version control files.  Because the pace of change is rapid and doing
so just makes it easier all around.

Here is the current HACKING file from the vcs online web frontend.

  http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=HACKING;hb=HEAD

Please read through that document.  It should give you all of the
information you need to submit patches to the project.  Be sure to
read the "Copyright assignment" section so that it doesn't come as a
surprise later after a lot of work has been put into it.  Any
non-trivial contribution needs an assignment and it is good to get
that started early.

If you have any questions please ask them.  Since this bug is already
created it is okay to follow-up with questions here.  Please keep the
bug log address in the recipient list.

But if you are asking questions or generating random discussion then
please use the coreutils <at> gnu.org mailing list instead of the bug
tracker.  We often spend a lot of time closing bug reports that are
doing nothing but asking questions.

Bob




Information forwarded to bug-coreutils <at> gnu.org:
bug#14224; Package coreutils. (Wed, 17 Apr 2013 23:30:02 GMT) Full text and rfc822 format available.

Message #11 received at 14224 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Bob Proulx <bob <at> proulx.com>
Cc: 14224 <at> debbugs.gnu.org, George Brink <siberianowl <at> gmail.com>
Subject: Re: bug#14224: Feature request for the `cut`: record delimiter
Date: Wed, 17 Apr 2013 17:24:35 -0600
[Message part 1 (text/plain, inline)]
On 04/17/2013 05:09 PM, Bob Proulx wrote:

In addition to Bob's (highly useful!) comments,

>> The README in coreutils suggests to read README-hacking and HACKING for
>> guide-lines on making a patch, but there are no such files in the the
>> coreutils-8.21.tar.xz.
> 
> Anyone working on the source code is expected to be working from the
> version control files.  Because the pace of change is rapid and doing
> so just makes it easier all around.
> 
> Here is the current HACKING file from the vcs online web frontend.
> 
>   http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=HACKING;hb=HEAD

Should we patch README to include this URL to current HACKING contents,
since we don't ship HACKING in our tarballs?  Or, should we reconsider
our position and start shipping HACKING in the tarballs?  Of the
statements currently in README:

> If you obtained this file as part of a "git clone", then see the
> README-hacking file.  If this file came to you as part of a tar archive,
> then see the file INSTALL for compilation and installation instructions.

This one makes sense (HACKING won't be present unless you are working
from git), except that you are not told _how_ to do a "git clone".

> If you would like to suggest a patch, see the files README-hacking
> and HACKING for tips.

But this one doesn't mention anything about the files being git-only.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#14224; Package coreutils. (Thu, 18 Apr 2013 01:19:02 GMT) Full text and rfc822 format available.

Message #14 received at 14224 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: George Brink <siberianowl <at> gmail.com>
Cc: 14224 <at> debbugs.gnu.org
Subject: Re: bug#14224: Feature request for the `cut`: record delimiter
Date: Wed, 17 Apr 2013 18:13:49 -0700
On 04/17/2013 02:26 PM, George Brink wrote:
> Hello,
> 
> I have a task of extracting several "fields" from the text file. The
> standard `cut` tool could be a perfect tool for a job, but...
> In my file the '\n' character is a legal symbol inside fields and therefore
> the text file uses other symbol for record-separator. And the `cut` has a
> hard-coded '\n' for record separator (I just checked the source from the
> coreutils-8.21 package).

The patch would be simple but not without compatibility cost.
I.E. scripts using this would immediately become incompatible
with any systems without this feature.

So you'd like something like tac -s, --separator
However cut -s is taken, so we'd have to avoid the short -s at least.
Also tac -s takes a string rather than a character, so
that gives some extra credence (and complexity) to that option there.

Also related would be to support the -z, --zero-terminated option.
join, sort and uniq all have this option to use NUL as the record separator,
however they're all closely related sort dependent utilities
and we're trying to unify options between them.

If it is just a character you want to separate on,
then you can always use tr to convert before processing,
albeit with associated data copying overhead.

SEP=^
tr "$SEP"'\n' '\n'"$SEP" | cut ... | tr "$SEP"'\n' '\n'"$SEP"

So given that cut is not special here among the text filters,
and there is a workaround available, I'm 60:40 against
adding this feature.

thanks,
Pádraig.




Information forwarded to bug-coreutils <at> gnu.org:
bug#14224; Package coreutils. (Thu, 18 Apr 2013 15:46:02 GMT) Full text and rfc822 format available.

Message #17 received at 14224 <at> debbugs.gnu.org (full text, mbox):

From: George Brink <siberianowl <at> gmail.com>
To: Pádraig Brady <P <at> draigbrady.com>, 
	Bob Proulx <bob <at> proulx.com>
Cc: 14224 <at> debbugs.gnu.org
Subject: Re: bug#14224: Feature request for the `cut`: record delimiter
Date: Thu, 18 Apr 2013 11:41:17 -0400
[Message part 1 (text/plain, inline)]
Pádraig,

Thank you for alternative suggestions.
Actually I just found yet another way to solve my problem:
perl -0002 -F"\001" -an -e "print((join \"\001\", @F[0..2,14..46]),
\"\002\");" data.dat >new_data.dat
It works fine, but I am a little concerned of the speed. I have over three
hundreds of such files, from 3Mb to 30Mb each. And this process should be
run every day... I thought that by using cut (which just looks for
delimiters) I can gain a few minutes on the whole process.

Originally I though of adding "-r, --record-delimiter=DELIM" and
"--output-record-delimiter=DELIM: keys to the cut.
Then the example above could be done with
cut -d☺ -r☻ --output-delimiter=☺ --output-record-delimiter=☻ -f1-3,15-47
data.dat >new_data.dat
I think it is feasible and would be more convenient (and hopefully faster)
than using a whole perl or two calls to tr.




Bob,
I understand your desire to receive a discussion of features not inside the
bug related mail list, but here is a extract from the README:
> Mail suggestions and bug reports for these programs to
> the address on the last line of --help output.
And guess what, the `cut --help` has the bug-coreutils email in the last
line! The coreutils email is not mentioned inside README at all. And
bug-coreutils is mentioned several times in different context.
I apologize for using this mail-list inappropriately, but I did not know
about any other mail-lists



On Wed, Apr 17, 2013 at 9:13 PM, Pádraig Brady <P <at> draigbrady.com> wrote:

> On 04/17/2013 02:26 PM, George Brink wrote:
> > Hello,
> >
> > I have a task of extracting several "fields" from the text file. The
> > standard `cut` tool could be a perfect tool for a job, but...
> > In my file the '\n' character is a legal symbol inside fields and
> therefore
> > the text file uses other symbol for record-separator. And the `cut` has a
> > hard-coded '\n' for record separator (I just checked the source from the
> > coreutils-8.21 package).
>
> The patch would be simple but not without compatibility cost.
> I.E. scripts using this would immediately become incompatible
> with any systems without this feature.
>
> So you'd like something like tac -s, --separator
> However cut -s is taken, so we'd have to avoid the short -s at least.
> Also tac -s takes a string rather than a character, so
> that gives some extra credence (and complexity) to that option there.
>
> Also related would be to support the -z, --zero-terminated option.
> join, sort and uniq all have this option to use NUL as the record
> separator,
> however they're all closely related sort dependent utilities
> and we're trying to unify options between them.
>
> If it is just a character you want to separate on,
> then you can always use tr to convert before processing,
> albeit with associated data copying overhead.
>
> SEP=^
> tr "$SEP"'\n' '\n'"$SEP" | cut ... | tr "$SEP"'\n' '\n'"$SEP"
>
> So given that cut is not special here among the text filters,
> and there is a workaround available, I'm 60:40 against
> adding this feature.
>
> thanks,
> Pádraig.
>
[Message part 2 (text/html, inline)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#14224; Package coreutils. (Thu, 18 Apr 2013 16:24:01 GMT) Full text and rfc822 format available.

Message #20 received at 14224 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: George Brink <siberianowl <at> gmail.com>
Cc: 14224 <at> debbugs.gnu.org, Bob Proulx <bob <at> proulx.com>
Subject: Re: bug#14224: Feature request for the `cut`: record delimiter
Date: Thu, 18 Apr 2013 09:18:30 -0700
On 04/18/2013 08:41 AM, George Brink wrote:
> On Wed, Apr 17, 2013 at 9:13 PM, Pádraig Brady <P <at> draigbrady.com> wrote:
> 
>> On 04/17/2013 02:26 PM, George Brink wrote:
>>> Hello,
>>>
>>> I have a task of extracting several "fields" from the text file. The
>>> standard `cut` tool could be a perfect tool for a job, but...
>>> In my file the '\n' character is a legal symbol inside fields and
>> therefore
>>> the text file uses other symbol for record-separator. And the `cut` has a
>>> hard-coded '\n' for record separator (I just checked the source from the
>>> coreutils-8.21 package).
>>
>> The patch would be simple but not without compatibility cost.
>> I.E. scripts using this would immediately become incompatible
>> with any systems without this feature.
>>
>> So you'd like something like tac -s, --separator
>> However cut -s is taken, so we'd have to avoid the short -s at least.
>> Also tac -s takes a string rather than a character, so
>> that gives some extra credence (and complexity) to that option there.
>>
>> Also related would be to support the -z, --zero-terminated option.
>> join, sort and uniq all have this option to use NUL as the record
>> separator,
>> however they're all closely related sort dependent utilities
>> and we're trying to unify options between them.
>>
>> If it is just a character you want to separate on,
>> then you can always use tr to convert before processing,
>> albeit with associated data copying overhead.
>>
>> SEP=^
>> tr "$SEP"'\n' '\n'"$SEP" | cut ... | tr "$SEP"'\n' '\n'"$SEP"
>>
>> So given that cut is not special here among the text filters,
>> and there is a workaround available, I'm 60:40 against
>> adding this feature.
>>
>> thanks,
>> Pádraig.
>>
> 
> Pádraig,
>
> Thank you for alternative suggestions.
> Actually I just found yet another way to solve my problem:
> perl -0002 -F"\001" -an -e "print((join \"\001\", @F[0..2,14..46]),
> \"\002\");" data.dat >new_data.dat
> It works fine, but I am a little concerned of the speed. I have over three
> hundreds of such files, from 3Mb to 30Mb each. And this process should be
> run every day... I thought that by using cut (which just looks for
> delimiters) I can gain a few minutes on the whole process.
>
> Originally I though of adding "-r, --record-delimiter=DELIM" and
> "--output-record-delimiter=DELIM: keys to the cut.
> Then the example above could be done with
> cut -d☺ -r☻ --output-delimiter=☺ --output-record-delimiter=☻ -f1-3,15-47
> data.dat >new_data.dat
> I think it is feasible and would be more convenient (and hopefully faster)
> than using a whole perl or two calls to tr.

Yes they're the tradeoffs.
awk is often suggested too as an alternative to cut.

> Bob,
> I understand your desire to receive a discussion of features not inside the
> bug related mail list, but here is a extract from the README:
>> Mail suggestions and bug reports for these programs to
>> the address on the last line of --help output.
> And guess what, the `cut --help` has the bug-coreutils email in the last
> line! The coreutils email is not mentioned inside README at all. And
> bug-coreutils is mentioned several times in different context.
> I apologize for using this mail-list inappropriately, but I did not know
> about any other mail-lists

No worries.  I saw no issue with your mails.
In future cut --help will just point at the
following URL which hopefully is easier to follow:
http://www.gnu.org/software/coreutils/

thanks,
Pádraig.




Information forwarded to bug-coreutils <at> gnu.org:
bug#14224; Package coreutils. (Thu, 18 Apr 2013 17:17:01 GMT) Full text and rfc822 format available.

Message #23 received at 14224 <at> debbugs.gnu.org (full text, mbox):

From: George Brink <siberianowl <at> gmail.com>
To: Pádraig Brady <P <at> draigbrady.com>
Cc: 14224 <at> debbugs.gnu.org
Subject: Re: bug#14224: Feature request for the `cut`: record delimiter
Date: Thu, 18 Apr 2013 13:12:01 -0400
[Message part 1 (text/plain, inline)]
On Thu, Apr 18, 2013 at 12:18 PM, Pádraig Brady <P <at> draigbrady.com> wrote:

>
> awk is often suggested too as an alternative to cut.
>
No, I looked at awk, but it does not have a convenient way to specify lists
of printed fields.
awk -e "BEGIN{FS="☺"; RS="☻"; OFS=FS; ORS=RS;}; {print $1,$2,$3,$15,$16,$17
??? ) }
You got the picture...
It is possible to repeat a cut in awk (and documentation for awk does show
how), but this would be a creation of an external application, not a
one-liner with a tool from the box.
[Message part 2 (text/html, inline)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#14224; Package coreutils. (Thu, 18 Apr 2013 19:01:02 GMT) Full text and rfc822 format available.

Message #26 received at 14224 <at> debbugs.gnu.org (full text, mbox):

From: Bob Proulx <bob <at> proulx.com>
To: George Brink <siberianowl <at> gmail.com>
Cc: 14224 <at> debbugs.gnu.org
Subject: Re: bug#14224: Feature request for the `cut`: record delimiter
Date: Thu, 18 Apr 2013 12:56:21 -0600
George Brink wrote:
> Actually I just found yet another way to solve my problem:
> perl -0002 -F"\001" -an -e "print((join \"\001\", @F[0..2,14..46]), \"\002\");" data.dat >new_data.dat
> It works fine,

I was thinking of Perl's -0 option when I asked if you would say a few
words about the file and task.  But since you had described it yet I
was hesitant to suggest it.

> but I am a little concerned of the speed. I have over three
> hundreds of such files, from 3Mb to 30Mb each. And this process should be
> run every day... I thought that by using cut (which just looks for
> delimiters) I can gain a few minutes on the whole process.

I always recommend benchmarking before optimizing.  Knuth is quoted as
"We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil".

Don't forget programmer productivity either.  You might shave 10% off
of something now but making it imcomprehensible to future admin
maintainers who need to understand it later.  Simply upgrading the
hardware might give a 50% increase in performance.  In which case I
would leave the algorithm simple and more easily understand and not
worry about the performance.  Simple and easy to understand is better
than raw speed.

> Bob,
> I understand your desire to receive a discussion of features not inside the
> bug related mail list, but here is a extract from the README:
> > Mail suggestions and bug reports for these programs to
> > the address on the last line of --help output.
> And guess what, the `cut --help` has the bug-coreutils email in the last
> line! The coreutils email is not mentioned inside README at all. And
> bug-coreutils is mentioned several times in different context.
> I apologize for using this mail-list inappropriately, but I did not know
> about any other mail-lists

As Pádraig said, no worries.  I didn't mean it to sound mean or
snarky.  But I can see that my last sentence did come out that way.
Sorry.

But if I didn't say anything then you wouldn't have said anything and
then we wouldn't have been reminded that the contact address hadn't
been updated in your version.  So it ended well.  The way to get the
word out is by continuing to talk about it.  If people even just read
it in passing then they might be informed for the future.

Bob




Information forwarded to bug-coreutils <at> gnu.org:
bug#14224; Package coreutils. (Thu, 18 Apr 2013 19:04:02 GMT) Full text and rfc822 format available.

Message #29 received at 14224 <at> debbugs.gnu.org (full text, mbox):

From: Bob Proulx <bob <at> proulx.com>
To: 14224 <at> debbugs.gnu.org
Cc: George Brink <siberianowl <at> gmail.com>
Subject: Re: bug#14224: Feature request for the `cut`: record delimiter
Date: Thu, 18 Apr 2013 12:58:31 -0600
Eric Blake wrote:
> Should we patch README to include this URL to current HACKING contents,
> since we don't ship HACKING in our tarballs?  Or, should we reconsider
> our position and start shipping HACKING in the tarballs?  Of the
> statements currently in README:
> 
> > If you obtained this file as part of a "git clone", then see the
> > README-hacking file.  If this file came to you as part of a tar archive,
> > then see the file INSTALL for compilation and installation instructions.
> 
> This one makes sense (HACKING won't be present unless you are working
> from git), except that you are not told _how_ to do a "git clone".
> 
> > If you would like to suggest a patch, see the files README-hacking
> > and HACKING for tips.
> 
> But this one doesn't mention anything about the files being git-only.

I think it would definitely make sense to include some information
about the preferred method of getting the source in the main README
file.  That file is usually the one included in downstream
distributions.  It would enable people to bootstrap themselves to the
source.  And GNU is all about access to the source.  So I think that
would make a lot of sense.

Bob




This bug report was last modified 11 years and 7 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.