GNU bug report logs - #26879
end-of-line issue with cygwin 4.4-1 sed 4.4

Previous Next

Package: sed;

Reported by: Dick Dunbar <dick.dunbar <at> gmail.com>

Date: Thu, 11 May 2017 15:29:02 UTC

Severity: normal

Tags: notabug

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 26879 in the body.
You can then email your comments to 26879 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-sed <at> gnu.org:
bug#26879; Package sed. (Thu, 11 May 2017 15:29:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Dick Dunbar <dick.dunbar <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-sed <at> gnu.org. (Thu, 11 May 2017 15:29:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Dick Dunbar <dick.dunbar <at> gmail.com>
To: bug-sed <at> gnu.org
Subject: end-of-line issue with cygwin 4.4-1 sed 4.4
Date: Thu, 11 May 2017 02:23:23 -0700
[Message part 1 (text/plain, inline)]
I've tested and searched and I can't figure this one out.

It's a simple filename quoting filter to handle Windows files
that contain blanks.  Easy stuff.

--- t.out ---
C:\Scan\i .2

--- sedtest.sh ---
#!/bin/bash
echo "1. simple string works"
fn="C:\Scan\i .2"
echo " $fn"
echo $fn   | sed -e "s/^/'/" -e "s/\$/'/"
echo " "
echo "2. against a cat file fails"
cat t.out  | sed -e "s/^/'/" -e "s/\$/'/"
echo " "
echo "3. against the file itself fails"
 sed -e "s/^/'/" -e "s/\$/'/" t.out
echo " "
echo "4. Hex dump of the file shows crlf termination"
od -xc t.out


--- sedtest output ---$ sedtest.sh
1. simple string works
 C:\Scan\i .2
'C:\Scan\i .2'

2. against a cat file fails
'C:\Scan\i .2

3. against the file itself fails
'C:\Scan\i .2

4. Hex dump of the file shows crlf termination
0000000    3a43    535c    6163    5c6e    2069    322e    0a0d
          C   :   \   S   c   a   n   \   i       .   2  \r  \n

==== Am I doing something wrong, or is this a bug? =====
[Message part 2 (text/html, inline)]

Added tag(s) notabug. Request was from Eric Blake <eblake <at> redhat.com> to control <at> debbugs.gnu.org. (Thu, 11 May 2017 17:59:01 GMT) Full text and rfc822 format available.

Reply sent to Eric Blake <eblake <at> redhat.com>:
You have taken responsibility. (Thu, 11 May 2017 17:59:02 GMT) Full text and rfc822 format available.

Notification sent to Dick Dunbar <dick.dunbar <at> gmail.com>:
bug acknowledged by developer. (Thu, 11 May 2017 17:59:02 GMT) Full text and rfc822 format available.

Message #12 received at 26879-done <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Dick Dunbar <dick.dunbar <at> gmail.com>, 26879-done <at> debbugs.gnu.org
Subject: Re: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Date: Thu, 11 May 2017 12:58:12 -0500
[Message part 1 (text/plain, inline)]
tag 26879 notabug
thanks

On 05/11/2017 04:23 AM, Dick Dunbar wrote:
> I've tested and searched and I can't figure this one out.

Correcting the subject line, there is no such thing as cygwin 4.4 (yet)
- cygwin is only at 2.8.0.  Individual programs are versioned
independently from cygwin1.dll.  So your report is about the Cygwin
pre-built binary of sed, where that sed version is 4.4-1.

> 
> It's a simple filename quoting filter to handle Windows files
> that contain blanks.  Easy stuff.
> 
> --- t.out ---
> C:\Scan\i .2
> 
> --- sedtest.sh ---
> #!/bin/bash
> echo "1. simple string works"
> fn="C:\Scan\i .2"
> echo " $fn"

$fn has no carriage return.

> echo $fn   | sed -e "s/^/'/" -e "s/\$/'/"

So this places the ' immediately after the 2.

> echo " "
> echo "2. against a cat file fails"
> cat t.out  | sed -e "s/^/'/" -e "s/\$/'/"

cat preserves line-endings, as does sed.  $ matches ONLY \n (not \r\n)
when in binary mode.  So you are sticking the ' in between \r and \n.
Visually, the way the terminal displays that is that it prints 2, then
rewinds to the beginning of the line, then displays ' (on top of what
was already ' that you inserted at the beginning), and then finally
moves to the next line.

> echo " "
> echo "3. against the file itself fails"
>  sed -e "s/^/'/" -e "s/\$/'/" t.out

Same story.


> 
> ==== Am I doing something wrong, or is this a bug? =====

You are forgetting that sed does NOT ignore \r on binary files.

Cygwin sed used to blindly treat binary files in text mode, but that was
INTENTIONALLY changed in February, in a coordinated move with grep and
awk at the same time.  If you fail to read cygwin release notes, it's
your own fault for being caught off-guard when you do a blind update:
https://cygwin.com/ml/cygwin-announce/2017-02/msg00036.html

Cygwin's goal is to emulate Linux, and Linux has the same behavior (of
NOT ignoring \r by default).  If you want to ignore \r, then explicitly
do so, either by massaging your data, using something like:
 d2u file | sed ...
 tr -d '\r' < file | sed ...
Or, you can use a text-mode mount instead of a binary-mode mount for
hosting file (the cygwin list is a better resource for how to set up a
text-mode mount point).

As such, I'm closing this as not an upstream bug.  If you don't like the
intentional change in cygwin behavior, that's something you may want to
bring up on the cygwin list, but there's nothing we can do about it here.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-sed <at> gnu.org:
bug#26879; Package sed. (Thu, 11 May 2017 19:20:02 GMT) Full text and rfc822 format available.

Message #15 received at 26879-done <at> debbugs.gnu.org (full text, mbox):

From: Dick Dunbar <dick.dunbar <at> gmail.com>
To: Eric Blake <eblake <at> redhat.com>
Cc: 26879-done <at> debbugs.gnu.org
Subject: Re: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Date: Thu, 11 May 2017 12:19:12 -0700
[Message part 1 (text/plain, inline)]
Thanks Eric,
but I'm a bit confused by the response.

sed is a stream editor;  introducing "binary files" as the reason for a
change
that leads to this failure doesn't sound right.

I can't forget what I never knew .. and changing "$" to mean something
other than the end of line because someone has a quarrel with Windows
ridiculous conventions doesn't make sense.

Reading the release notes I still would not have detected a failure like
this.

vi can properly handle crlf in cygwin;  why can't sed?
If it's a mode setting that's needed, than let me make the decision once
for my environment instead of facing these roadblocks in the future.

Changing something that has worked for years and then introducing
yet another filter (tr,d2u) to cover up for that change is silly.

I object to the "notabug" closure ... ok, if it's not a bug than it
certainly
is a "mental lapse in judgment".

That's the whole purpose of tools like cygwin;  to bridge the gap between
programming skills between unix and windows.  If the purpose of this change
is to make it painfully obvious that those differences exist ... for no
good purpose ...
I have to question the decision.

To bypass this problem, I simply modified the source of the pgm generating
the filenames to do the quoting.

I won't be touching sed again anytime soon.   Can't easily erase 30 years
of pgming habit.

If I need sed, I'll grab the source to sed and fix it myself.
Perhaps the busybox version works better, or I should permanently
switch to the MobaXterm platform instead.

Regarding version;  here are the first two lines of sed --version
$ sed --version
sed (GNU sed) 4.4
Packaged by Cygwin (4.4-1)

Ok, there is no cygwin 4.4-1 ... but there are hundreds of version
numbers listed in the components of a cygwin distribution.
The closest thing I have to a cygwin version is the setup program.


On Thu, May 11, 2017 at 10:58 AM, Eric Blake <eblake <at> redhat.com> wrote:

> tag 26879 notabug
> thanks
>
> On 05/11/2017 04:23 AM, Dick Dunbar wrote:
> > I've tested and searched and I can't figure this one out.
>
> Correcting the subject line, there is no such thing as cygwin 4.4 (yet)
> - cygwin is only at 2.8.0.  Individual programs are versioned
> independently from cygwin1.dll.  So your report is about the Cygwin
> pre-built binary of sed, where that sed version is 4.4-1.
>
> >
> > It's a simple filename quoting filter to handle Windows files
> > that contain blanks.  Easy stuff.
> >
> > --- t.out ---
> > C:\Scan\i .2
> >
> > --- sedtest.sh ---
> > #!/bin/bash
> > echo "1. simple string works"
> > fn="C:\Scan\i .2"
> > echo " $fn"
>
> $fn has no carriage return.
>
> > echo $fn   | sed -e "s/^/'/" -e "s/\$/'/"
>
> So this places the ' immediately after the 2.
>
> > echo " "
> > echo "2. against a cat file fails"
> > cat t.out  | sed -e "s/^/'/" -e "s/\$/'/"
>
> cat preserves line-endings, as does sed.  $ matches ONLY \n (not \r\n)
> when in binary mode.  So you are sticking the ' in between \r and \n.
> Visually, the way the terminal displays that is that it prints 2, then
> rewinds to the beginning of the line, then displays ' (on top of what
> was already ' that you inserted at the beginning), and then finally
> moves to the next line.
>
> > echo " "
> > echo "3. against the file itself fails"
> >  sed -e "s/^/'/" -e "s/\$/'/" t.out
>
> Same story.
>
>
> >
> > ==== Am I doing something wrong, or is this a bug? =====
>
> You are forgetting that sed does NOT ignore \r on binary files.
>
> Cygwin sed used to blindly treat binary files in text mode, but that was
> INTENTIONALLY changed in February, in a coordinated move with grep and
> awk at the same time.  If you fail to read cygwin release notes, it's
> your own fault for being caught off-guard when you do a blind update:
> https://cygwin.com/ml/cygwin-announce/2017-02/msg00036.html
>
> Cygwin's goal is to emulate Linux, and Linux has the same behavior (of
> NOT ignoring \r by default).  If you want to ignore \r, then explicitly
> do so, either by massaging your data, using something like:
>  d2u file | sed ...
>  tr -d '\r' < file | sed ...
> Or, you can use a text-mode mount instead of a binary-mode mount for
> hosting file (the cygwin list is a better resource for how to set up a
> text-mode mount point).
>
> As such, I'm closing this as not an upstream bug.  If you don't like the
> intentional change in cygwin behavior, that's something you may want to
> bring up on the cygwin list, but there's nothing we can do about it here.
>
> --
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.           +1-919-301-3266
> Virtualization:  qemu.org | libvirt.org
>
>
[Message part 2 (text/html, inline)]

Information forwarded to bug-sed <at> gnu.org:
bug#26879; Package sed. (Thu, 11 May 2017 19:55:02 GMT) Full text and rfc822 format available.

Message #18 received at 26879-done <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Dick Dunbar <dick.dunbar <at> gmail.com>
Cc: 26879-done <at> debbugs.gnu.org
Subject: Re: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Date: Thu, 11 May 2017 14:54:06 -0500
[Message part 1 (text/plain, inline)]
On 05/11/2017 02:19 PM, Dick Dunbar wrote:
> Thanks Eric,
> but I'm a bit confused by the response.
> 
> sed is a stream editor;  introducing "binary files" as the reason for a
> change
> that leads to this failure doesn't sound right.

Again, the change you are complaining about is NOT an upstream change,
but a downstream Cygwin change.  Your comments will reach a more
appropriate audience on the Cygwin list.

The cygwin change was made because there are people using cygwin to
process binary files that were surprised that cygwin sed silently ate \r
in those files, where it did NOT do so on Linux.  It was traced back to
sed FORCEFULLY opening files in text mode, even when the user wanted
binary mode.  Silently eating input, with no recourse, is a form of data
corruption.

Yes, when there's a difference between text and binary mode, it's nice
to be able to choose which mode you want to use.  But that's the point -
it should be a choice, not something where the tool says it knows better
than you and forces on you.  If the choice requires you typing a new
command line option, or filtering (whether via d2u or any other viable
filter), at least you are now in full control of your data without sed
presuming whether \r was irrelevant.

Making a change that you can work around (by filtering your data) was
deemed better than leaving the behavior unchanged (where you corrupted
data, and had no viable workaround).

> 
> If I need sed, I'll grab the source to sed and fix it myself.
> Perhaps the busybox version works better, or I should permanently
> switch to the MobaXterm platform instead.

That's one of the beauties of free software - you are ALLOWED to do
that.  In fact, you can take the cygwin sources, and make a one-line
tweak to tell the linker to include textmode.o as part of linking the
final binary, and YOUR build will instantly be back to ALWAYS ignoring
\r on input (and forcefully adding \r on output, even when input did not
have \r).  Or you can indeed use a different pre-built binary if you
don't like the way the cygwin binary behaves.  When it comes to dealing
with the (non-POSIX) difference between text and binary files, there are
several workarounds, some more invasive than others, and not all
downstream ports agree on which method is best.  The Cygwin port has
decided that it prefers Linux emulation (and binary-file preservation)
as a higher priority than windows interoperability; but other projects,
like MSYS, feel differently and go out of their way to maximize
convenience of windows interoperability rather than blind emulation of
Linux.

> 
> Regarding version;  here are the first two lines of sed --version
> $ sed --version
> sed (GNU sed) 4.4
> Packaged by Cygwin (4.4-1)

That doesn't mean that Cygwin's version is 4.4-1, but that your binary
was part of the sed-4.4-1 package from the Cygwin distro.  To learn the
version of cygwin1.dll, you can use 'uname -a', or 'cygcheck -c cygwin'.

> Ok, there is no cygwin 4.4-1 ... but there are hundreds of version
> numbers listed in the components of a cygwin distribution.
> The closest thing I have to a cygwin version is the setup program.

Not to make it more confusing, but cygwin's setup.exe is also versioned
independently from cygwin1.dll.  Some other distros have an overarching
"version" (such as Fedora 25, where I'm typing this email); the Cygwin
project does not choose to use such an overall number.  But even on
distros with an overall number, you still have lots of individual
packages, each with their own independent version number.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-sed <at> gnu.org:
bug#26879; Package sed. (Thu, 11 May 2017 20:14:02 GMT) Full text and rfc822 format available.

Message #21 received at 26879-done <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Eric Blake <eblake <at> redhat.com>, Dick Dunbar <dick.dunbar <at> gmail.com>
Cc: 26879-done <at> debbugs.gnu.org
Subject: Re: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Date: Thu, 11 May 2017 16:13:24 -0400
Hello all,


Eric,
Just to verify (since I'm not very familiar with cygwin, nor with the
recent changes in sed/cygwin changes):

If one wants the old sed behavior on cygwin (automatic
handling of CR/LF),
all that's needed is rebuilding sed from upstream source?

That is:

    wget https://ftpmirror.gnu.org/sed/sed-4.4.tar.xz
    tar -xf sed-4.4.tar.xz
    cd sed-4.4
    ./configure
    make
    sudo make install

And then the sed binary will handle CR/LF transparently?
(and will also have the old "-b/--binary" flag to disable
automatic CR/LF handling) ?

I'm asking this to help find an easy work-around in case
other cygwin users want revert to the "old ways".
(also useful for us upstream to know about this cygwin change,
in case we get more bug reports).


Thanks,
 - assaf




Information forwarded to bug-sed <at> gnu.org:
bug#26879; Package sed. (Thu, 11 May 2017 20:30:03 GMT) Full text and rfc822 format available.

Message #24 received at 26879-done <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Assaf Gordon <assafgordon <at> gmail.com>, Dick Dunbar <dick.dunbar <at> gmail.com>
Cc: 26879-done <at> debbugs.gnu.org
Subject: Re: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Date: Thu, 11 May 2017 15:29:13 -0500
[Message part 1 (text/plain, inline)]
On 05/11/2017 03:13 PM, Assaf Gordon wrote:
> Hello all,
> 
> 
> Eric,
> Just to verify (since I'm not very familiar with cygwin, nor with the
> recent changes in sed/cygwin changes):
> 
> If one wants the old sed behavior on cygwin (automatic
> handling of CR/LF),
> all that's needed is rebuilding sed from upstream source?
> 
> That is:
> 
>     wget https://ftpmirror.gnu.org/sed/sed-4.4.tar.xz
>     tar -xf sed-4.4.tar.xz
>     cd sed-4.4
>     ./configure
>     make
>     sudo make install
> 
> And then the sed binary will handle CR/LF transparently?
> (and will also have the old "-b/--binary" flag to disable
> automatic CR/LF handling) ?

No. If you want to FORCE sed to treat input as text (to transparently
ignore CR), you have to make a tweak to the source code (either to link
with cygwin's textmode.o that turns text mode on EVERYWHERE, or to add
freopen("rt") or setmode(O_TEXT) calls in appropriate places.

The default upstream behavior has ALWAYS been to handle files in native
mode (ie. open("r") - where the choice of text or binary is determined
by the file system).  Downstream Cygwin sed USED to have a patch that
overrode upstream behavior to do freopen(NULL, "rt", stdin) - which is
not portable outside of Cygwin, but which on Cygwin is defined to
forcefully reopen a file in text mode, even if it was originally in
binary mode.  If your file is already in text mode, the downstream patch
made no difference; but if your file was in binary mode, the downstream
patch forcefully corrupted your data by eating \r.

With the Cygwin build 4.4-1 of sed, that downstream patch was eliminated
(along with corresponding downstream hacks in awk and grep, as well as
an upstream simplification in grep made possible now that downstream was
no longer forcing text mode, http://bugs.gnu.org/25707), so that all
three tools reliably treated binary files as binary, and your choice of
text vs. binary mount was honored by using open("r") (rather than
open("rb") which forces binary or open("rt") which is non-POSIX but on
cygwin forces text).

The drawback is that not all input is on a file system - if your input
comes through a pipeline, you can't set the mount mode of a pipeline,
and cygwin assumes all pipes are in binary mode.  But in those cases,
you can always modify your pipeline to inject another filter to eat the
\r before handing the data to sed.

> 
> I'm asking this to help find an easy work-around in case
> other cygwin users want revert to the "old ways".
> (also useful for us upstream to know about this cygwin change,
> in case we get more bug reports).

The source code with the downstream patches for the older cygwin builds
is still available; in fact, the EASIEST thing might be to tell
disgruntled cygwin users to google for "Cygwin time machine" and install
grep, sed, and awk from Jan 2017 (pre-dating the Feb 2017 switch in
behavior), as the patch you would apply to upstream sources is already
built into those older downstream binaries.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-sed <at> gnu.org:
bug#26879; Package sed. (Thu, 11 May 2017 21:22:01 GMT) Full text and rfc822 format available.

Message #27 received at 26879-done <at> debbugs.gnu.org (full text, mbox):

From: Dick Dunbar <dick.dunbar <at> gmail.com>
To: Eric Blake <eblake <at> redhat.com>
Cc: 26879-done <at> debbugs.gnu.org
Subject: Re: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Date: Thu, 11 May 2017 14:21:15 -0700
[Message part 1 (text/plain, inline)]
Here's another disconnect from that sed change.

- vi t.out
  A --- go to end of line

So if vi can find the end of line, why can't sed?
This is really incompatible and drives experienced developers crazy.

Your comment that $fn was a string and contained no end-of-line chars.

And yet, sed did the right thing here and found the end of ascii text.

It is that kind of experimentation that shows sed is not doing the right
thing.

So a little change like that in order to make linux people feel better
is the wrong place to put a fix to make them compatible.

sed $ should eat \r, \n or end of all ascii text ... on both platforms.

My development experience ranges over AIX, Solaris,  HPUX, Linux in various
distros, Win

I once sat at a cygwin terminal session and wondered why an AIX
command wasn't found.  Scratch-head, Check $PATH, etc.

So the astonishing thing to me at the time was that cygwin had done such
a good job of masking differences among unix-variants and Windows that
I completely forgot I was on cygwin.

This is  a good thing.


On Thu, May 11, 2017 at 12:54 PM, Eric Blake <eblake <at> redhat.com> wrote:

> On 05/11/2017 02:19 PM, Dick Dunbar wrote:
> > Thanks Eric,
> > but I'm a bit confused by the response.
> >
> > sed is a stream editor;  introducing "binary files" as the reason for a
> > change
> > that leads to this failure doesn't sound right.
>
> Again, the change you are complaining about is NOT an upstream change,
> but a downstream Cygwin change.  Your comments will reach a more
> appropriate audience on the Cygwin list.
>
> The cygwin change was made because there are people using cygwin to
> process binary files that were surprised that cygwin sed silently ate \r
> in those files, where it did NOT do so on Linux.  It was traced back to
> sed FORCEFULLY opening files in text mode, even when the user wanted
> binary mode.  Silently eating input, with no recourse, is a form of data
> corruption.
>
> Yes, when there's a difference between text and binary mode, it's nice
> to be able to choose which mode you want to use.  But that's the point -
> it should be a choice, not something where the tool says it knows better
> than you and forces on you.  If the choice requires you typing a new
> command line option, or filtering (whether via d2u or any other viable
> filter), at least you are now in full control of your data without sed
> presuming whether \r was irrelevant.
>
> Making a change that you can work around (by filtering your data) was
> deemed better than leaving the behavior unchanged (where you corrupted
> data, and had no viable workaround).
>
> >
> > If I need sed, I'll grab the source to sed and fix it myself.
> > Perhaps the busybox version works better, or I should permanently
> > switch to the MobaXterm platform instead.
>
> That's one of the beauties of free software - you are ALLOWED to do
> that.  In fact, you can take the cygwin sources, and make a one-line
> tweak to tell the linker to include textmode.o as part of linking the
> final binary, and YOUR build will instantly be back to ALWAYS ignoring
> \r on input (and forcefully adding \r on output, even when input did not
> have \r).  Or you can indeed use a different pre-built binary if you
> don't like the way the cygwin binary behaves.  When it comes to dealing
> with the (non-POSIX) difference between text and binary files, there are
> several workarounds, some more invasive than others, and not all
> downstream ports agree on which method is best.  The Cygwin port has
> decided that it prefers Linux emulation (and binary-file preservation)
> as a higher priority than windows interoperability; but other projects,
> like MSYS, feel differently and go out of their way to maximize
> convenience of windows interoperability rather than blind emulation of
> Linux.
>
> >
> > Regarding version;  here are the first two lines of sed --version
> > $ sed --version
> > sed (GNU sed) 4.4
> > Packaged by Cygwin (4.4-1)
>
> That doesn't mean that Cygwin's version is 4.4-1, but that your binary
> was part of the sed-4.4-1 package from the Cygwin distro.  To learn the
> version of cygwin1.dll, you can use 'uname -a', or 'cygcheck -c cygwin'.
>
> > Ok, there is no cygwin 4.4-1 ... but there are hundreds of version
> > numbers listed in the components of a cygwin distribution.
> > The closest thing I have to a cygwin version is the setup program.
>
> Not to make it more confusing, but cygwin's setup.exe is also versioned
> independently from cygwin1.dll.  Some other distros have an overarching
> "version" (such as Fedora 25, where I'm typing this email); the Cygwin
> project does not choose to use such an overall number.  But even on
> distros with an overall number, you still have lots of individual
> packages, each with their own independent version number.
>
> --
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.           +1-919-301-3266
> Virtualization:  qemu.org | libvirt.org
>
>
[Message part 2 (text/html, inline)]

Information forwarded to bug-sed <at> gnu.org:
bug#26879; Package sed. (Thu, 11 May 2017 22:01:02 GMT) Full text and rfc822 format available.

Message #30 received at 26879-done <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Eric Blake <eblake <at> redhat.com>, Dick Dunbar <dick.dunbar <at> gmail.com>
Cc: 26879-done <at> debbugs.gnu.org
Subject: Re: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Date: Thu, 11 May 2017 17:59:57 -0400
On 05/11/2017 04:29 PM, Eric Blake wrote:
> On 05/11/2017 03:13 PM, Assaf Gordon wrote:
>> If one wants the old sed behavior on cygwin (automatic
>> handling of CR/LF),
>> all that's needed is rebuilding sed from upstream source?
> 
> No. [...]
> The default upstream behavior has ALWAYS been to handle files in native
> mode (ie. open("r") - where the choice of text or binary is determined
> by the file system).  Downstream Cygwin sed USED to have a patch that
> overrode upstream behavior to do freopen(NULL, "rt", stdin)

I see. Thanks for explaining.

So the only systems where 'sed' does automatically strip CR/LF
are MingW/MSVC/MSDOS builds (and only there the "-b/--binary" option
makes a difference) ?

If so,
should we remove the "#ifdef __CYGWIN__" from sed's source code
since it now behaves exactly like gnu/linux ?
e.g.
https://git.savannah.gnu.org/cgit/sed.git/tree/sed/sed.c#n151
https://git.savannah.gnu.org/cgit/sed.git/tree/sed/execute.c#n560



> The drawback is that not all input is on a file system - if your input
> comes through a pipeline, you can't set the mount mode of a pipeline,
> and cygwin assumes all pipes are in binary mode.  But in those cases,
> you can always modify your pipeline to inject another filter to eat the
> \r before handing the data to sed.

To summarize, IIUC:
If someone uses new (post feb-2017) cygwin exclusively -
everything should "just work" and files have only '\n' line endings.

Line-Ending problems will occur of someone mixes old/new cygwin
tools or files (e.g. files created on old cygwin and used with newer
cygwin programs),
or
if mixing cygwin/non-cygwin tools.

Correct?

Thanks,
 - assaf


Out of curiosity (if anyone knows):
What does "Windows Subsystem For Linux" do with line-endings?






Information forwarded to bug-sed <at> gnu.org:
bug#26879; Package sed. (Thu, 11 May 2017 22:08:01 GMT) Full text and rfc822 format available.

Message #33 received at 26879-done <at> debbugs.gnu.org (full text, mbox):

From: Dick Dunbar <dick.dunbar <at> gmail.com>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: Eric Blake <eblake <at> redhat.com>, 26879-done <at> debbugs.gnu.org
Subject: Re: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Date: Thu, 11 May 2017 15:07:04 -0700
[Message part 1 (text/plain, inline)]
On Thu, May 11, 2017 at 2:59 PM, Assaf Gordon <assafgordon <at> gmail.com> wrote:

>
> What does "Windows Subsystem For Linux" do with line-endings?



It is a linux kernel, ported to native Windows api's.

I had it installed briefly, but threw it out because they required you
to turn off UAC globally for the machine.   It's beta.
They'll eventually figure out how to make bash operate like a real
windows trusted executable.  I have cygwin;  I can wait.
[Message part 2 (text/html, inline)]

Information forwarded to bug-sed <at> gnu.org:
bug#26879; Package sed. (Thu, 11 May 2017 22:40:01 GMT) Full text and rfc822 format available.

Message #36 received at 26879-done <at> debbugs.gnu.org (full text, mbox):

From: Dick Dunbar <dick.dunbar <at> gmail.com>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: Eric Blake <eblake <at> redhat.com>, 26879-done <at> debbugs.gnu.org
Subject: Re: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Date: Thu, 11 May 2017 15:39:11 -0700
[Message part 1 (text/plain, inline)]
To round out this discussion:
I wanted a simple filter to ensure filename paths didn't contain spaces.

For example:
  find /foo -maxdepth 1 -atime +366 -print0 |
     xargs -0 sh -c 'mv "$@" /archive' move

So why are there different flags to indicate null-terminated lines?
  find -print0
  xargs -0
  sed  -z

Seems silly.  To make a non-breaking-code-change,
why not add "-z" to the find and xargs command so they are compatible?

And ... because we're dealing with the same issue of executables
creating stream data, why doesn't sed/awk/grep have an option
to deal with null delimited lines such that "$" would find them.

Or will it just work ... as my example of echoing $fn does where
sed finds the end-of-line ( by length, or because no more ascii chars ).

Having sed recognize \r, \n, \0 as end of line might cause some
breakage if you have to deal with data that has embedded nulls.
So it might require a sed flag ( -r0 ) to signal intent.

Had to check:
find . -type f -print0 | sed -e "s/^/'/" -e "s/\$/'/"

Doesn't work.  One very long string of null terminated filenames is
returned.

So we now know that sed does not check for \0 as a line terminator.
And the sed -z flag produces the same long string.

find . -type f -print0 | sed -z -e "s/^/'/" -e "s/\$/'/"






On Thu, May 11, 2017 at 3:07 PM, Dick Dunbar <dick.dunbar <at> gmail.com> wrote:

>
> On Thu, May 11, 2017 at 2:59 PM, Assaf Gordon <assafgordon <at> gmail.com>
> wrote:
>
>>
>> What does "Windows Subsystem For Linux" do with line-endings?
>
>
>
> It is a linux kernel, ported to native Windows api's.
>
> I had it installed briefly, but threw it out because they required you
> to turn off UAC globally for the machine.   It's beta.
> They'll eventually figure out how to make bash operate like a real
> windows trusted executable.  I have cygwin;  I can wait.
>
[Message part 2 (text/html, inline)]

Information forwarded to bug-sed <at> gnu.org:
bug#26879; Package sed. (Fri, 12 May 2017 02:06:01 GMT) Full text and rfc822 format available.

Message #39 received at 26879-done <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Dick Dunbar <dick.dunbar <at> gmail.com>
Cc: Eric Blake <eblake <at> redhat.com>, 26879-done <at> debbugs.gnu.org
Subject: Re: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Date: Thu, 11 May 2017 22:05:33 -0400
Hello,

> On May 11, 2017, at 18:39, Dick Dunbar <dick.dunbar <at> gmail.com> wrote:
> 
> To round out this discussion:
> I wanted a simple filter to ensure filename paths didn't contain spaces.

There's a nuance here to verify:
Did you want a filter to ensure non of your files have spaces (e.g. detect
if some haves do have spaces and then fail),
or
Did you want a robust way to use the 'mv' command (as below), even in
the case of files with spaces ?

If you just wanted to detect files with spaces,
something like this would work:
    find -type f -print0 | grep -qz '[[:space:]]' && echo have-files-with-spaces

If you wanted to print files that have spaces, something like this would
work:
    find -type f -print0 | grep -z '[[:space:]]' | tr '\0' '\n'

> For example:
>   find /foo -maxdepth 1 -atime +366 -print0 |
>      xargs -0 sh -c 'mv "$@" /archive' move

I'm not sure what the purpose of 'move' in the above command.
But if you wanted to move all the found files to the directory /archive,
even if they have spaces in them, a more efficient way would be:

    find /foo -maxdepth 1 -atime +366 -print0 | \
       xargs -0 mv -no-run-if-empty -t /archive

This GNU extension (-t DEST) works great with find/xargs,
as xargs by default adds the parameters at the end of the command line,
and "-t" means the destination directory is specified first.

> So why are there different flags to indicate null-terminated lines?
>   find -print0
>   xargs -0
>   sed  -z
> 
> Seems silly.  To make a non-breaking-code-change,
> why not add "-z" to the find and xargs command so they are compatible?

Putting aside the naming conversion for a moment (Remember that each program is developed
by different people) - I'll focus on find/xargs - which are part of the same
package (findutils) and developed by the same people.

These two are designed to work closely together - that's why they have
"-print0" and "-0".

The whole point of the following construct:

   find [criteria] -print0 | xargs -0 ANY-PROGRAM

Is that 'ANY-PROGRAM' doesn't need to understand NUL-line-endings at all.
The main reason find and xargs need the NULLs is to ensure
file names are not broken by whitespace or even newlines. But once xargs reads
the entire filename, it passes each filename as a single parameter to ANY-PROGRAM,
and so there's no need to worry any more about filenames with whitespaces.

This useful constructs breaks down if ANY-PROGRAM is 'sh' which the might
do further parameter splitting based on whitespace.

> And ... because we're dealing with the same issue of executables
> creating stream data, why doesn't sed/awk/grep have an option
> to deal with null delimited lines such that "$" would find them.

I'm not sure I understand: sed and grep have "-z" exactly for this purpose.
(also: sort -z , perl -0).
gawk has a slight different syntax, where you simply set the RS (input
record separator) to NULL:

   find -type f -print0 | gawk -vRS="\0" -vORS="\n" '{ print "file = " $0 }'

But remember that when you use 'sed -z', the output also uses NULs as line-terminators,
so it won't look good on the terminal or in a file.

> Having sed recognize \r, \n, \0 as end of line might cause some 
> breakage if you have to deal with data that has embedded nulls.

Instead of thinking in general "data that has embedded nulls",
it'll be easier to consider concrete cases.
Text files do not have embedded nuls (by definition, otherwise they are not
text files). So standard text programs (sed/grep/awk) do not need to deal with NULs
as line separators.

The main use case of having NUL as line separator is precisely with "find -print0".
In this case, either use "xargs -0" and then the actual program doesn't need
to worry about NULs at all, or use the gnu extensions (e.g. 'sed -z' or 'grep -z').

> Had to check:
> find . -type f -print0 | sed -e "s/^/'/" -e "s/\$/'/"
> 
> Doesn't work.  One very long string of null terminated filenames is returned.

It works perfectly:
1. sed without -z treats newlines (\n) as line terminators.
2. 'find -print0' did not generate '\n' character at all.
3. 'sed' read the entire input (i.e. all files separated by NULs),
   treated it as one line, and added quotes at the beginning and the end
   of the entire buffer.
4. NULs were kept as-is, and are printed on your terminal.

Example:
    $ touch a b 'c d'
    $ find -type f -print0 | sed -e "s/^/'/" -e "s/\$/'/" | od -An -tx1c
      27  2e  2f  61  00  2e  2f  62  00  2e  2f  63  20  64  00  27
       '   .   /   a  \0   .   /   b  \0   .   /   c       d  \0   '

> So we now know that sed does not check for \0 as a line terminator.
> And the sed -z flag produces the same long string.
> 
> find . -type f -print0 | sed -z -e "s/^/'/" -e "s/\$/'/"

It also produces the correct output:
This time, because of the '-z', sed indeed reads each filename until the NUL,
and adds quotes around each file.
But it also uses NULs as line terminators on the OUTPUT,
so newline characters are not used at all.
Notice that each file is surrounded by quotes, exactly as you've asked:

  $ find -type f -print0 | sed -z -e "s/^/'/" -e "s/\$/'/" | od -An -tx1c
    27  2e  2f  61  27  00  27  2e  2f  62  27  00  27  2e  2f  63
     '   .   /   a   '  \0   '   .   /   b   '  \0   '   .   /   c
    20  64  27  00
         d   '  \0

The missing piece is that after you've processed each file using 'sed -z',
if you want to print them to the terminal, you still need to convert NULs to newlines:

  $ find -type f -print0 | sed -z -e "s/^/'/" -e "s/\$/'/" | tr '\0' '\n'
  './a'
  './b'
  './c d'

Or, if you wanted to user sed/grep as an intermediate filter between 'find' and 'xargs',
then something like this:

  find [criteria] -print0 | grep -z [REGEX] | xargs -0 ANYPROGRAM
  find [criteria] -print0 | sed -z [REGEX] | xargs -0 ANYPROGRAM


In most of my examples above, whitespace don't actual cause problems -
because sed/grep will not be confused by whitespace and won't break the line
(it is mostly shell argument parsing that will get terribly confused by whitespace,
and also "xargs" with certain parameters).

They real 'kick' is that using NULs allows handling files that have embedded newlines.

Consider the following:

  $ touch a b 'c d' "$(printf 'e\nf')"
  $ ls -log
  total 0
  -rw-r--r-- 1 0 May 12 01:43 a
  -rw-r--r-- 1 0 May 12 01:43 b
  -rw-r--r-- 1 0 May 12 01:43 c d
  -rw-r--r-- 1 0 May 12 01:43 e?f

The last file has an embedded newline, which will mess-up 'find':

  ## incorrect output: the 'e\nf' file is broken, 'echo' is executed
  ## wrong number of times with non-existing file names:
  $ find -type f | xargs -I% echo ==%==
  ==./e==
  ==f==
  ==./a==
  ==./b==
  ==./c d==

Using 'xargs -0' will solve it. This output is correct, but perhaps confusing
when displayed on the terminal:

  $ find -type f -print0 | xargs -0 -I% echo ==%==
  ==./e
  f==
  ==./a==
  ==./b==
  ==./c d==

And similarly with 'sed -z':

  $ find -type f -print0 | sed -z -e 's/^/<<</' -e 's/$/>>>/' | tr '\0' '\n'
  <<<./e
  f>>>
  <<<./a>>>
  <<<./b>>>
  <<<./c d>>>



Once last tip:
Sometimes you want to find and operate on files based on the their content instead
of attributes (e.g. 'grep').

Here too, a file with spaces or newlines will cause troubles:

  $ echo yes  > "$(printf 'hello\nworld')"
  $ ls -log
  total 4
  -rw-r--r-- 1 4 May 12 01:57 hello?world

If you wanted to find all files containing 'yes',
grep alone would print a confusing output:

  $ grep -l yes *
  hello
  world

And using it with "xargs" will fail:

  $ grep -l yes * | xargs -I% echo 'handling file ===%==='
  handling file ===hello===
  handling file ===world===

Grep has a separate option (upper case -Z) to print the matched filenames
with a NUL instead of a newline. This enables correct handling:

  $ grep -lZ yes * | xargs -0 -I% echo 'handling file ===%s==='
  handling file ===hello
  worlds===

And later:

  $ grep -lZ yes * | xargs -0 mv -t /destination



Hope this helps,
regards,
 - assaf







Information forwarded to bug-sed <at> gnu.org:
bug#26879; Package sed. (Fri, 12 May 2017 09:18:01 GMT) Full text and rfc822 format available.

Message #42 received at 26879-done <at> debbugs.gnu.org (full text, mbox):

From: Dick Dunbar <dick.dunbar <at> gmail.com>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: Eric Blake <eblake <at> redhat.com>, 26879-done <at> debbugs.gnu.org
Subject: Re: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Date: Fri, 12 May 2017 02:17:39 -0700
[Message part 1 (text/plain, inline)]
There is nothing tricky about this sed filter.
I want to render filenames emitted by a program ( not find) in single
quotes so that no special characters are interpreted by the shell:
  ( space, $, etc )

The "mv" example was just another type of filter used by different
(known)  cygwin/unix programs.  The current problem remains with sed.
I remain mystified why the semantics of "$" ( end of line ) was changed,
and still cannot imagine any program that would benefit from such a change.

Yes I understood Eric's explanation that it was to make Linux users
more comfortable.  What was wrong with the previous sed implementation.

Linux users have to deal with Windows files containing crlf line-end
all the time.  What, exactly, is the problem you were trying to solve?
The cygwin definition should work fine on Posix systems.
In all my cross-platform experience, I never had to question sed's
definition of "$".   It just worked.

Sorry for expanding the conversation to flags in other pgms;  that's a
separate discussion.


On Thu, May 11, 2017 at 7:05 PM, Assaf Gordon <assafgordon <at> gmail.com> wrote:

> Hello,
>
> > On May 11, 2017, at 18:39, Dick Dunbar <dick.dunbar <at> gmail.com> wrote:
> >
> > To round out this discussion:
> > I wanted a simple filter to ensure filename paths didn't contain spaces.
>
> There's a nuance here to verify:
> Did you want a filter to ensure non of your files have spaces (e.g. detect
> if some haves do have spaces and then fail),
> or
> Did you want a robust way to use the 'mv' command (as below), even in
> the case of files with spaces ?
>
> If you just wanted to detect files with spaces,
> something like this would work:
>     find -type f -print0 | grep -qz '[[:space:]]' && echo
> have-files-with-spaces
>
> If you wanted to print files that have spaces, something like this would
> work:
>     find -type f -print0 | grep -z '[[:space:]]' | tr '\0' '\n'
>
> > For example:
> >   find /foo -maxdepth 1 -atime +366 -print0 |
> >      xargs -0 sh -c 'mv "$@" /archive' move
>
> I'm not sure what the purpose of 'move' in the above command.
> But if you wanted to move all the found files to the directory /archive,
> even if they have spaces in them, a more efficient way would be:
>
>     find /foo -maxdepth 1 -atime +366 -print0 | \
>        xargs -0 mv -no-run-if-empty -t /archive
>
> This GNU extension (-t DEST) works great with find/xargs,
> as xargs by default adds the parameters at the end of the command line,
> and "-t" means the destination directory is specified first.
>
> > So why are there different flags to indicate null-terminated lines?
> >   find -print0
> >   xargs -0
> >   sed  -z
> >
> > Seems silly.  To make a non-breaking-code-change,
> > why not add "-z" to the find and xargs command so they are compatible?
>
> Putting aside the naming conversion for a moment (Remember that each
> program is developed
> by different people) - I'll focus on find/xargs - which are part of the
> same
> package (findutils) and developed by the same people.
>
> These two are designed to work closely together - that's why they have
> "-print0" and "-0".
>
> The whole point of the following construct:
>
>    find [criteria] -print0 | xargs -0 ANY-PROGRAM
>
> Is that 'ANY-PROGRAM' doesn't need to understand NUL-line-endings at all.
> The main reason find and xargs need the NULLs is to ensure
> file names are not broken by whitespace or even newlines. But once xargs
> reads
> the entire filename, it passes each filename as a single parameter to
> ANY-PROGRAM,
> and so there's no need to worry any more about filenames with whitespaces.
>
> This useful constructs breaks down if ANY-PROGRAM is 'sh' which the might
> do further parameter splitting based on whitespace.
>
> > And ... because we're dealing with the same issue of executables
> > creating stream data, why doesn't sed/awk/grep have an option
> > to deal with null delimited lines such that "$" would find them.
>
> I'm not sure I understand: sed and grep have "-z" exactly for this purpose.
> (also: sort -z , perl -0).
> gawk has a slight different syntax, where you simply set the RS (input
> record separator) to NULL:
>
>    find -type f -print0 | gawk -vRS="\0" -vORS="\n" '{ print "file = " $0
> }'
>
> But remember that when you use 'sed -z', the output also uses NULs as
> line-terminators,
> so it won't look good on the terminal or in a file.
>
> > Having sed recognize \r, \n, \0 as end of line might cause some
> > breakage if you have to deal with data that has embedded nulls.
>
> Instead of thinking in general "data that has embedded nulls",
> it'll be easier to consider concrete cases.
> Text files do not have embedded nuls (by definition, otherwise they are not
> text files). So standard text programs (sed/grep/awk) do not need to deal
> with NULs
> as line separators.
>
> The main use case of having NUL as line separator is precisely with "find
> -print0".
> In this case, either use "xargs -0" and then the actual program doesn't
> need
> to worry about NULs at all, or use the gnu extensions (e.g. 'sed -z' or
> 'grep -z').
>
> > Had to check:
> > find . -type f -print0 | sed -e "s/^/'/" -e "s/\$/'/"
> >
> > Doesn't work.  One very long string of null terminated filenames is
> returned.
>
> It works perfectly:
> 1. sed without -z treats newlines (\n) as line terminators.
> 2. 'find -print0' did not generate '\n' character at all.
> 3. 'sed' read the entire input (i.e. all files separated by NULs),
>    treated it as one line, and added quotes at the beginning and the end
>    of the entire buffer.
> 4. NULs were kept as-is, and are printed on your terminal.
>
> Example:
>     $ touch a b 'c d'
>     $ find -type f -print0 | sed -e "s/^/'/" -e "s/\$/'/" | od -An -tx1c
>       27  2e  2f  61  00  2e  2f  62  00  2e  2f  63  20  64  00  27
>        '   .   /   a  \0   .   /   b  \0   .   /   c       d  \0   '
>
> > So we now know that sed does not check for \0 as a line terminator.
> > And the sed -z flag produces the same long string.
> >
> > find . -type f -print0 | sed -z -e "s/^/'/" -e "s/\$/'/"
>
> It also produces the correct output:
> This time, because of the '-z', sed indeed reads each filename until the
> NUL,
> and adds quotes around each file.
> But it also uses NULs as line terminators on the OUTPUT,
> so newline characters are not used at all.
> Notice that each file is surrounded by quotes, exactly as you've asked:
>
>   $ find -type f -print0 | sed -z -e "s/^/'/" -e "s/\$/'/" | od -An -tx1c
>     27  2e  2f  61  27  00  27  2e  2f  62  27  00  27  2e  2f  63
>      '   .   /   a   '  \0   '   .   /   b   '  \0   '   .   /   c
>     20  64  27  00
>          d   '  \0
>
> The missing piece is that after you've processed each file using 'sed -z',
> if you want to print them to the terminal, you still need to convert NULs
> to newlines:
>
>   $ find -type f -print0 | sed -z -e "s/^/'/" -e "s/\$/'/" | tr '\0' '\n'
>   './a'
>   './b'
>   './c d'
>
> Or, if you wanted to user sed/grep as an intermediate filter between
> 'find' and 'xargs',
> then something like this:
>
>   find [criteria] -print0 | grep -z [REGEX] | xargs -0 ANYPROGRAM
>   find [criteria] -print0 | sed -z [REGEX] | xargs -0 ANYPROGRAM
>
>
> In most of my examples above, whitespace don't actual cause problems -
> because sed/grep will not be confused by whitespace and won't break the
> line
> (it is mostly shell argument parsing that will get terribly confused by
> whitespace,
> and also "xargs" with certain parameters).
>
> They real 'kick' is that using NULs allows handling files that have
> embedded newlines.
>
> Consider the following:
>
>   $ touch a b 'c d' "$(printf 'e\nf')"
>   $ ls -log
>   total 0
>   -rw-r--r-- 1 0 May 12 01:43 a
>   -rw-r--r-- 1 0 May 12 01:43 b
>   -rw-r--r-- 1 0 May 12 01:43 c d
>   -rw-r--r-- 1 0 May 12 01:43 e?f
>
> The last file has an embedded newline, which will mess-up 'find':
>
>   ## incorrect output: the 'e\nf' file is broken, 'echo' is executed
>   ## wrong number of times with non-existing file names:
>   $ find -type f | xargs -I% echo ==%==
>   ==./e==
>   ==f==
>   ==./a==
>   ==./b==
>   ==./c d==
>
> Using 'xargs -0' will solve it. This output is correct, but perhaps
> confusing
> when displayed on the terminal:
>
>   $ find -type f -print0 | xargs -0 -I% echo ==%==
>   ==./e
>   f==
>   ==./a==
>   ==./b==
>   ==./c d==
>
> And similarly with 'sed -z':
>
>   $ find -type f -print0 | sed -z -e 's/^/<<</' -e 's/$/>>>/' | tr '\0'
> '\n'
>   <<<./e
>   f>>>
>   <<<./a>>>
>   <<<./b>>>
>   <<<./c d>>>
>
>
>
> Once last tip:
> Sometimes you want to find and operate on files based on the their content
> instead
> of attributes (e.g. 'grep').
>
> Here too, a file with spaces or newlines will cause troubles:
>
>   $ echo yes  > "$(printf 'hello\nworld')"
>   $ ls -log
>   total 4
>   -rw-r--r-- 1 4 May 12 01:57 hello?world
>
> If you wanted to find all files containing 'yes',
> grep alone would print a confusing output:
>
>   $ grep -l yes *
>   hello
>   world
>
> And using it with "xargs" will fail:
>
>   $ grep -l yes * | xargs -I% echo 'handling file ===%==='
>   handling file ===hello===
>   handling file ===world===
>
> Grep has a separate option (upper case -Z) to print the matched filenames
> with a NUL instead of a newline. This enables correct handling:
>
>   $ grep -lZ yes * | xargs -0 -I% echo 'handling file ===%s==='
>   handling file ===hello
>   worlds===
>
> And later:
>
>   $ grep -lZ yes * | xargs -0 mv -t /destination
>
>
>
> Hope this helps,
> regards,
>  - assaf
>
>
>
>
[Message part 2 (text/html, inline)]

Information forwarded to bug-sed <at> gnu.org:
bug#26879; Package sed. (Fri, 12 May 2017 09:27:02 GMT) Full text and rfc822 format available.

Message #45 received at 26879-done <at> debbugs.gnu.org (full text, mbox):

From: Dick Dunbar <dick.dunbar <at> gmail.com>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: Eric Blake <eblake <at> redhat.com>, 26879-done <at> debbugs.gnu.org
Subject: Re: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Date: Fri, 12 May 2017 02:26:18 -0700
[Message part 1 (text/plain, inline)]
It is still unexplained how sed correctly finds the end-of-line correctly
when there are no control characters at all.  ( \r, \n )

In the original sedtest.sh script I posted,
   fn

Running that script again ... when there are two additional blank
characters at the end of $fn, produces the desired result.
The single quote follows the "2", even tho there are 2 blanks at the end of
the string.

fn="C:\Scan\i .2  "

$ ./sedtest.sh
1. simple string works
 C:\Scan\i .2
'C:\Scan\i .2'


On Fri, May 12, 2017 at 2:17 AM, Dick Dunbar <dick.dunbar <at> gmail.com> wrote:

> There is nothing tricky about this sed filter.
> I want to render filenames emitted by a program ( not find) in single
> quotes so that no special characters are interpreted by the shell:
>   ( space, $, etc )
>
> The "mv" example was just another type of filter used by different
> (known)  cygwin/unix programs.  The current problem remains with sed.
> I remain mystified why the semantics of "$" ( end of line ) was changed,
> and still cannot imagine any program that would benefit from such a change.
>
> Yes I understood Eric's explanation that it was to make Linux users
> more comfortable.  What was wrong with the previous sed implementation.
>
> Linux users have to deal with Windows files containing crlf line-end
> all the time.  What, exactly, is the problem you were trying to solve?
> The cygwin definition should work fine on Posix systems.
> In all my cross-platform experience, I never had to question sed's
> definition of "$".   It just worked.
>
> Sorry for expanding the conversation to flags in other pgms;  that's a
> separate discussion.
>
>
> On Thu, May 11, 2017 at 7:05 PM, Assaf Gordon <assafgordon <at> gmail.com>
> wrote:
>
>> Hello,
>>
>> > On May 11, 2017, at 18:39, Dick Dunbar <dick.dunbar <at> gmail.com> wrote:
>> >
>> > To round out this discussion:
>> > I wanted a simple filter to ensure filename paths didn't contain spaces.
>>
>> There's a nuance here to verify:
>> Did you want a filter to ensure non of your files have spaces (e.g. detect
>> if some haves do have spaces and then fail),
>> or
>> Did you want a robust way to use the 'mv' command (as below), even in
>> the case of files with spaces ?
>>
>> If you just wanted to detect files with spaces,
>> something like this would work:
>>     find -type f -print0 | grep -qz '[[:space:]]' && echo
>> have-files-with-spaces
>>
>> If you wanted to print files that have spaces, something like this would
>> work:
>>     find -type f -print0 | grep -z '[[:space:]]' | tr '\0' '\n'
>>
>> > For example:
>> >   find /foo -maxdepth 1 -atime +366 -print0 |
>> >      xargs -0 sh -c 'mv "$@" /archive' move
>>
>> I'm not sure what the purpose of 'move' in the above command.
>> But if you wanted to move all the found files to the directory /archive,
>> even if they have spaces in them, a more efficient way would be:
>>
>>     find /foo -maxdepth 1 -atime +366 -print0 | \
>>        xargs -0 mv -no-run-if-empty -t /archive
>>
>> This GNU extension (-t DEST) works great with find/xargs,
>> as xargs by default adds the parameters at the end of the command line,
>> and "-t" means the destination directory is specified first.
>>
>> > So why are there different flags to indicate null-terminated lines?
>> >   find -print0
>> >   xargs -0
>> >   sed  -z
>> >
>> > Seems silly.  To make a non-breaking-code-change,
>> > why not add "-z" to the find and xargs command so they are compatible?
>>
>> Putting aside the naming conversion for a moment (Remember that each
>> program is developed
>> by different people) - I'll focus on find/xargs - which are part of the
>> same
>> package (findutils) and developed by the same people.
>>
>> These two are designed to work closely together - that's why they have
>> "-print0" and "-0".
>>
>> The whole point of the following construct:
>>
>>    find [criteria] -print0 | xargs -0 ANY-PROGRAM
>>
>> Is that 'ANY-PROGRAM' doesn't need to understand NUL-line-endings at all.
>> The main reason find and xargs need the NULLs is to ensure
>> file names are not broken by whitespace or even newlines. But once xargs
>> reads
>> the entire filename, it passes each filename as a single parameter to
>> ANY-PROGRAM,
>> and so there's no need to worry any more about filenames with whitespaces.
>>
>> This useful constructs breaks down if ANY-PROGRAM is 'sh' which the might
>> do further parameter splitting based on whitespace.
>>
>> > And ... because we're dealing with the same issue of executables
>> > creating stream data, why doesn't sed/awk/grep have an option
>> > to deal with null delimited lines such that "$" would find them.
>>
>> I'm not sure I understand: sed and grep have "-z" exactly for this
>> purpose.
>> (also: sort -z , perl -0).
>> gawk has a slight different syntax, where you simply set the RS (input
>> record separator) to NULL:
>>
>>    find -type f -print0 | gawk -vRS="\0" -vORS="\n" '{ print "file = " $0
>> }'
>>
>> But remember that when you use 'sed -z', the output also uses NULs as
>> line-terminators,
>> so it won't look good on the terminal or in a file.
>>
>> > Having sed recognize \r, \n, \0 as end of line might cause some
>> > breakage if you have to deal with data that has embedded nulls.
>>
>> Instead of thinking in general "data that has embedded nulls",
>> it'll be easier to consider concrete cases.
>> Text files do not have embedded nuls (by definition, otherwise they are
>> not
>> text files). So standard text programs (sed/grep/awk) do not need to deal
>> with NULs
>> as line separators.
>>
>> The main use case of having NUL as line separator is precisely with "find
>> -print0".
>> In this case, either use "xargs -0" and then the actual program doesn't
>> need
>> to worry about NULs at all, or use the gnu extensions (e.g. 'sed -z' or
>> 'grep -z').
>>
>> > Had to check:
>> > find . -type f -print0 | sed -e "s/^/'/" -e "s/\$/'/"
>> >
>> > Doesn't work.  One very long string of null terminated filenames is
>> returned.
>>
>> It works perfectly:
>> 1. sed without -z treats newlines (\n) as line terminators.
>> 2. 'find -print0' did not generate '\n' character at all.
>> 3. 'sed' read the entire input (i.e. all files separated by NULs),
>>    treated it as one line, and added quotes at the beginning and the end
>>    of the entire buffer.
>> 4. NULs were kept as-is, and are printed on your terminal.
>>
>> Example:
>>     $ touch a b 'c d'
>>     $ find -type f -print0 | sed -e "s/^/'/" -e "s/\$/'/" | od -An -tx1c
>>       27  2e  2f  61  00  2e  2f  62  00  2e  2f  63  20  64  00  27
>>        '   .   /   a  \0   .   /   b  \0   .   /   c       d  \0   '
>>
>> > So we now know that sed does not check for \0 as a line terminator.
>> > And the sed -z flag produces the same long string.
>> >
>> > find . -type f -print0 | sed -z -e "s/^/'/" -e "s/\$/'/"
>>
>> It also produces the correct output:
>> This time, because of the '-z', sed indeed reads each filename until the
>> NUL,
>> and adds quotes around each file.
>> But it also uses NULs as line terminators on the OUTPUT,
>> so newline characters are not used at all.
>> Notice that each file is surrounded by quotes, exactly as you've asked:
>>
>>   $ find -type f -print0 | sed -z -e "s/^/'/" -e "s/\$/'/" | od -An -tx1c
>>     27  2e  2f  61  27  00  27  2e  2f  62  27  00  27  2e  2f  63
>>      '   .   /   a   '  \0   '   .   /   b   '  \0   '   .   /   c
>>     20  64  27  00
>>          d   '  \0
>>
>> The missing piece is that after you've processed each file using 'sed -z',
>> if you want to print them to the terminal, you still need to convert NULs
>> to newlines:
>>
>>   $ find -type f -print0 | sed -z -e "s/^/'/" -e "s/\$/'/" | tr '\0' '\n'
>>   './a'
>>   './b'
>>   './c d'
>>
>> Or, if you wanted to user sed/grep as an intermediate filter between
>> 'find' and 'xargs',
>> then something like this:
>>
>>   find [criteria] -print0 | grep -z [REGEX] | xargs -0 ANYPROGRAM
>>   find [criteria] -print0 | sed -z [REGEX] | xargs -0 ANYPROGRAM
>>
>>
>> In most of my examples above, whitespace don't actual cause problems -
>> because sed/grep will not be confused by whitespace and won't break the
>> line
>> (it is mostly shell argument parsing that will get terribly confused by
>> whitespace,
>> and also "xargs" with certain parameters).
>>
>> They real 'kick' is that using NULs allows handling files that have
>> embedded newlines.
>>
>> Consider the following:
>>
>>   $ touch a b 'c d' "$(printf 'e\nf')"
>>   $ ls -log
>>   total 0
>>   -rw-r--r-- 1 0 May 12 01:43 a
>>   -rw-r--r-- 1 0 May 12 01:43 b
>>   -rw-r--r-- 1 0 May 12 01:43 c d
>>   -rw-r--r-- 1 0 May 12 01:43 e?f
>>
>> The last file has an embedded newline, which will mess-up 'find':
>>
>>   ## incorrect output: the 'e\nf' file is broken, 'echo' is executed
>>   ## wrong number of times with non-existing file names:
>>   $ find -type f | xargs -I% echo ==%==
>>   ==./e==
>>   ==f==
>>   ==./a==
>>   ==./b==
>>   ==./c d==
>>
>> Using 'xargs -0' will solve it. This output is correct, but perhaps
>> confusing
>> when displayed on the terminal:
>>
>>   $ find -type f -print0 | xargs -0 -I% echo ==%==
>>   ==./e
>>   f==
>>   ==./a==
>>   ==./b==
>>   ==./c d==
>>
>> And similarly with 'sed -z':
>>
>>   $ find -type f -print0 | sed -z -e 's/^/<<</' -e 's/$/>>>/' | tr '\0'
>> '\n'
>>   <<<./e
>>   f>>>
>>   <<<./a>>>
>>   <<<./b>>>
>>   <<<./c d>>>
>>
>>
>>
>> Once last tip:
>> Sometimes you want to find and operate on files based on the their
>> content instead
>> of attributes (e.g. 'grep').
>>
>> Here too, a file with spaces or newlines will cause troubles:
>>
>>   $ echo yes  > "$(printf 'hello\nworld')"
>>   $ ls -log
>>   total 4
>>   -rw-r--r-- 1 4 May 12 01:57 hello?world
>>
>> If you wanted to find all files containing 'yes',
>> grep alone would print a confusing output:
>>
>>   $ grep -l yes *
>>   hello
>>   world
>>
>> And using it with "xargs" will fail:
>>
>>   $ grep -l yes * | xargs -I% echo 'handling file ===%==='
>>   handling file ===hello===
>>   handling file ===world===
>>
>> Grep has a separate option (upper case -Z) to print the matched filenames
>> with a NUL instead of a newline. This enables correct handling:
>>
>>   $ grep -lZ yes * | xargs -0 -I% echo 'handling file ===%s==='
>>   handling file ===hello
>>   worlds===
>>
>> And later:
>>
>>   $ grep -lZ yes * | xargs -0 mv -t /destination
>>
>>
>>
>> Hope this helps,
>> regards,
>>  - assaf
>>
>>
>>
>>
>
[Message part 2 (text/html, inline)]

Information forwarded to bug-sed <at> gnu.org:
bug#26879; Package sed. (Fri, 12 May 2017 19:31:01 GMT) Full text and rfc822 format available.

Message #48 received at 26879-done <at> debbugs.gnu.org (full text, mbox):

From: Dick Dunbar <dick.dunbar <at> gmail.com>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: Eric Blake <eblake <at> redhat.com>, 26879-done <at> debbugs.gnu.org
Subject: Re: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Date: Fri, 12 May 2017 12:30:29 -0700
[Message part 1 (text/plain, inline)]
Hi Assaf and Eric,
Thanks for your remarks.  Very thoughtful and helpful.

1. I hadn't realized sed had a -z option.  Here's how I used it:
   find -print0 | sed -ze "s/^/'/" -e "s/\$/'\n/"

2. Rather than fighting with sed behaviour, it's just easier to use Eric's
suggestion
    to strip the \r in a separate stage.  But this doesn't do that.  It
replaces \r
    with a null character followed by \n

    $ cat t.out | tr -d '\r' | od -xc
     0000000    3a43    535c    6163    5c6e    2069    322e    000a
                   C   :   \   S   c   a   n   \   i       .   2  \n

    And running a stream through d2u will cause the entire pipe to
    stall until eof on the stream.

3. Eric: the discussion of binary file open confused me.
    Does sed default to binary open?  How would you suggest I
    fix this in user-land?

I don't really understand what  'info sed' is saying because sed
can operate on a stream -or- a file.  It's not just mixing Win programs
and cygwin programs that causes problems.  It is very common to
get files from multiple platforms.

Editing a 'sh' script with notepad will definitely ruin your day:

#!/bin/bash \r\n

vim identifies an edited file as "dos" if it encounters one.

'-b'
'--binary'
     This option is available on every platform, but is only effective
     where the operating system makes a distinction between text files
     and binary files.  When such a distinction is made--as is the case
     for MS-DOS, Windows, Cygwin--text files are composed of lines
     separated by a carriage return _and_ a line feed character, and
     'sed' does not see the ending CR. When this option is specified,
     'sed' will open input files in binary mode, thus not requesting
     this special processing and considering lines to end at a line feed.

Throughout the sed documentation, '\n' is called "new line".
In the --binary description it is correctly called "line feed".  ( CR+LF)

https://en.wikipedia.org/wiki/Newline

Several years ago, I switched from the O'Reilly sed/awk book
I like this one:  http://www.thegeekstuff.com/sed-awk-101-hacks-ebook

I'm not sure I would have ever picked up on this cygwin change
by reading release notes or info sed.

It doesn't hurt until it bites you.

-Cheers guys;  thanks for being friendly



On Thu, May 11, 2017 at 7:05 PM, Assaf Gordon <assafgordon <at> gmail.com> wrote:

> Hello,
>
> > On May 11, 2017, at 18:39, Dick Dunbar <dick.dunbar <at> gmail.com> wrote:
> >
> > To round out this discussion:
> > I wanted a simple filter to ensure filename paths didn't contain spaces.
>
> There's a nuance here to verify:
> Did you want a filter to ensure non of your files have spaces (e.g. detect
> if some haves do have spaces and then fail),
> or
> Did you want a robust way to use the 'mv' command (as below), even in
> the case of files with spaces ?
>
> If you just wanted to detect files with spaces,
> something like this would work:
>     find -type f -print0 | grep -qz '[[:space:]]' && echo
> have-files-with-spaces
>
> If you wanted to print files that have spaces, something like this would
> work:
>     find -type f -print0 | grep -z '[[:space:]]' | tr '\0' '\n'
>
> > For example:
> >   find /foo -maxdepth 1 -atime +366 -print0 |
> >      xargs -0 sh -c 'mv "$@" /archive' move
>
> I'm not sure what the purpose of 'move' in the above command.
> But if you wanted to move all the found files to the directory /archive,
> even if they have spaces in them, a more efficient way would be:
>
>     find /foo -maxdepth 1 -atime +366 -print0 | \
>        xargs -0 mv -no-run-if-empty -t /archive
>
> This GNU extension (-t DEST) works great with find/xargs,
> as xargs by default adds the parameters at the end of the command line,
> and "-t" means the destination directory is specified first.
>
> > So why are there different flags to indicate null-terminated lines?
> >   find -print0
> >   xargs -0
> >   sed  -z
> >
> > Seems silly.  To make a non-breaking-code-change,
> > why not add "-z" to the find and xargs command so they are compatible?
>
> Putting aside the naming conversion for a moment (Remember that each
> program is developed
> by different people) - I'll focus on find/xargs - which are part of the
> same
> package (findutils) and developed by the same people.
>
> These two are designed to work closely together - that's why they have
> "-print0" and "-0".
>
> The whole point of the following construct:
>
>    find [criteria] -print0 | xargs -0 ANY-PROGRAM
>
> Is that 'ANY-PROGRAM' doesn't need to understand NUL-line-endings at all.
> The main reason find and xargs need the NULLs is to ensure
> file names are not broken by whitespace or even newlines. But once xargs
> reads
> the entire filename, it passes each filename as a single parameter to
> ANY-PROGRAM,
> and so there's no need to worry any more about filenames with whitespaces.
>
> This useful constructs breaks down if ANY-PROGRAM is 'sh' which the might
> do further parameter splitting based on whitespace.
>
> > And ... because we're dealing with the same issue of executables
> > creating stream data, why doesn't sed/awk/grep have an option
> > to deal with null delimited lines such that "$" would find them.
>
> I'm not sure I understand: sed and grep have "-z" exactly for this purpose.
> (also: sort -z , perl -0).
> gawk has a slight different syntax, where you simply set the RS (input
> record separator) to NULL:
>
>    find -type f -print0 | gawk -vRS="\0" -vORS="\n" '{ print "file = " $0
> }'
>
> But remember that when you use 'sed -z', the output also uses NULs as
> line-terminators,
> so it won't look good on the terminal or in a file.
>
> > Having sed recognize \r, \n, \0 as end of line might cause some
> > breakage if you have to deal with data that has embedded nulls.
>
> Instead of thinking in general "data that has embedded nulls",
> it'll be easier to consider concrete cases.
> Text files do not have embedded nuls (by definition, otherwise they are not
> text files). So standard text programs (sed/grep/awk) do not need to deal
> with NULs
> as line separators.
>
> The main use case of having NUL as line separator is precisely with "find
> -print0".
> In this case, either use "xargs -0" and then the actual program doesn't
> need
> to worry about NULs at all, or use the gnu extensions (e.g. 'sed -z' or
> 'grep -z').
>
> > Had to check:
> > find . -type f -print0 | sed -e "s/^/'/" -e "s/\$/'/"
> >
> > Doesn't work.  One very long string of null terminated filenames is
> returned.
>
> It works perfectly:
> 1. sed without -z treats newlines (\n) as line terminators.
> 2. 'find -print0' did not generate '\n' character at all.
> 3. 'sed' read the entire input (i.e. all files separated by NULs),
>    treated it as one line, and added quotes at the beginning and the end
>    of the entire buffer.
> 4. NULs were kept as-is, and are printed on your terminal.
>
> Example:
>     $ touch a b 'c d'
>     $ find -type f -print0 | sed -e "s/^/'/" -e "s/\$/'/" | od -An -tx1c
>       27  2e  2f  61  00  2e  2f  62  00  2e  2f  63  20  64  00  27
>        '   .   /   a  \0   .   /   b  \0   .   /   c       d  \0   '
>
> > So we now know that sed does not check for \0 as a line terminator.
> > And the sed -z flag produces the same long string.
> >
> > find . -type f -print0 | sed -z -e "s/^/'/" -e "s/\$/'/"
>
> It also produces the correct output:
> This time, because of the '-z', sed indeed reads each filename until the
> NUL,
> and adds quotes around each file.
> But it also uses NULs as line terminators on the OUTPUT,
> so newline characters are not used at all.
> Notice that each file is surrounded by quotes, exactly as you've asked:
>
>   $ find -type f -print0 | sed -z -e "s/^/'/" -e "s/\$/'/" | od -An -tx1c
>     27  2e  2f  61  27  00  27  2e  2f  62  27  00  27  2e  2f  63
>      '   .   /   a   '  \0   '   .   /   b   '  \0   '   .   /   c
>     20  64  27  00
>          d   '  \0
>
> The missing piece is that after you've processed each file using 'sed -z',
> if you want to print them to the terminal, you still need to convert NULs
> to newlines:
>
>   $ find -type f -print0 | sed -z -e "s/^/'/" -e "s/\$/'/" | tr '\0' '\n'
>   './a'
>   './b'
>   './c d'
>
> Or, if you wanted to user sed/grep as an intermediate filter between
> 'find' and 'xargs',
> then something like this:
>
>   find [criteria] -print0 | grep -z [REGEX] | xargs -0 ANYPROGRAM
>   find [criteria] -print0 | sed -z [REGEX] | xargs -0 ANYPROGRAM
>
>
> In most of my examples above, whitespace don't actual cause problems -
> because sed/grep will not be confused by whitespace and won't break the
> line
> (it is mostly shell argument parsing that will get terribly confused by
> whitespace,
> and also "xargs" with certain parameters).
>
> They real 'kick' is that using NULs allows handling files that have
> embedded newlines.
>
> Consider the following:
>
>   $ touch a b 'c d' "$(printf 'e\nf')"
>   $ ls -log
>   total 0
>   -rw-r--r-- 1 0 May 12 01:43 a
>   -rw-r--r-- 1 0 May 12 01:43 b
>   -rw-r--r-- 1 0 May 12 01:43 c d
>   -rw-r--r-- 1 0 May 12 01:43 e?f
>
> The last file has an embedded newline, which will mess-up 'find':
>
>   ## incorrect output: the 'e\nf' file is broken, 'echo' is executed
>   ## wrong number of times with non-existing file names:
>   $ find -type f | xargs -I% echo ==%==
>   ==./e==
>   ==f==
>   ==./a==
>   ==./b==
>   ==./c d==
>
> Using 'xargs -0' will solve it. This output is correct, but perhaps
> confusing
> when displayed on the terminal:
>
>   $ find -type f -print0 | xargs -0 -I% echo ==%==
>   ==./e
>   f==
>   ==./a==
>   ==./b==
>   ==./c d==
>
> And similarly with 'sed -z':
>
>   $ find -type f -print0 | sed -z -e 's/^/<<</' -e 's/$/>>>/' | tr '\0'
> '\n'
>   <<<./e
>   f>>>
>   <<<./a>>>
>   <<<./b>>>
>   <<<./c d>>>
>
>
>
> Once last tip:
> Sometimes you want to find and operate on files based on the their content
> instead
> of attributes (e.g. 'grep').
>
> Here too, a file with spaces or newlines will cause troubles:
>
>   $ echo yes  > "$(printf 'hello\nworld')"
>   $ ls -log
>   total 4
>   -rw-r--r-- 1 4 May 12 01:57 hello?world
>
> If you wanted to find all files containing 'yes',
> grep alone would print a confusing output:
>
>   $ grep -l yes *
>   hello
>   world
>
> And using it with "xargs" will fail:
>
>   $ grep -l yes * | xargs -I% echo 'handling file ===%==='
>   handling file ===hello===
>   handling file ===world===
>
> Grep has a separate option (upper case -Z) to print the matched filenames
> with a NUL instead of a newline. This enables correct handling:
>
>   $ grep -lZ yes * | xargs -0 -I% echo 'handling file ===%s==='
>   handling file ===hello
>   worlds===
>
> And later:
>
>   $ grep -lZ yes * | xargs -0 mv -t /destination
>
>
>
> Hope this helps,
> regards,
>  - assaf
>
>
>
>
[Message part 2 (text/html, inline)]

Information forwarded to bug-sed <at> gnu.org:
bug#26879; Package sed. (Fri, 12 May 2017 20:49:01 GMT) Full text and rfc822 format available.

Message #51 received at 26879-done <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Assaf Gordon <assafgordon <at> gmail.com>, Dick Dunbar <dick.dunbar <at> gmail.com>
Cc: 26879-done <at> debbugs.gnu.org
Subject: Re: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Date: Fri, 12 May 2017 15:47:56 -0500
[Message part 1 (text/plain, inline)]
On 05/11/2017 04:59 PM, Assaf Gordon wrote:
> On 05/11/2017 04:29 PM, Eric Blake wrote:
>> On 05/11/2017 03:13 PM, Assaf Gordon wrote:
>>> If one wants the old sed behavior on cygwin (automatic
>>> handling of CR/LF),
>>> all that's needed is rebuilding sed from upstream source?
>>
>> No. [...]
>> The default upstream behavior has ALWAYS been to handle files in native
>> mode (ie. open("r") - where the choice of text or binary is determined
>> by the file system).  Downstream Cygwin sed USED to have a patch that
>> overrode upstream behavior to do freopen(NULL, "rt", stdin)
> 
> I see. Thanks for explaining.
> 
> So the only systems where 'sed' does automatically strip CR/LF
> are MingW/MSVC/MSDOS builds (and only there the "-b/--binary" option
> makes a difference) ?

Upstream sed doesn't ever actively strip CR by itself. Rather, opening a
file in default mode (that is, open("r")), on a system where that mode
resolves to text mode (Cygwin depending on your mount options, and I
think on mingw/MSVC by default) causes CR to be stripped by the
underlying libc.  Using -b/--binary causes sed to use open("rb") instead
of open("r"), at which point you tell libc to use binary mode no matter
what.

Downstream sed on Cygwin used to add a patch to use open("rt") to force
text mode unless you used 'sed -b'; that downstream patch was removed in
Feb 2017.

> 
> If so,
> should we remove the "#ifdef __CYGWIN__" from sed's source code
> since it now behaves exactly like gnu/linux ?

Cygwin behaves like gnu/linux if you use binary mount points. But Cygwin
still supports text mount points, and therefore 'sed -b' is still useful
on Cygwin, and therefore I don't think the #ifdef __CYGWIN__ should be
removed from sed's source code.  (I do, however, think the #ifdef could
be rewritten to '#if O_BINARY', because a non-zero O_BINARY is a more
reliable indicator of the platforms where binary-vs-text actually
matters, without having to be a long list of specific platforms)


> To summarize, IIUC:
> If someone uses new (post feb-2017) cygwin exclusively -
> everything should "just work" and files have only '\n' line endings.

If you manage your data solely through Cygwin programs, then your data
should only have \n line endings, so sed should "just work".  But if you
intermix cygwin programs with data from other sources (a trivial example
being the Windows command shell, whose builtin 'echo' uses \r\n line
endings), then cygwin's default of treating pipe input as binary coupled
with the native windows' application default of generating output as
text means that sed will act strangely unless you rewrite your pipeline
to filter out the \r from the native app before feeding it to sed.

> 
> Line-Ending problems will occur of someone mixes old/new cygwin
> tools or files (e.g. files created on old cygwin and used with newer
> cygwin programs),

No, cygwin has favored binary files for years now, unless you
specifically configure for a text mount. (Text mounts used to be very
easy to set up by running cygwin's setup.exe, but we removed that
functionality at least 10 years ago because it caused more problems than
it solved, so setting up a text mode mount is now a lot more involved).
So mixing data created with old cygwin with sed from new cygwin is
unlikely to cause problems if you never changed defaults, because the
defaults have been to produce data with \n endings for years now.

> or
> if mixing cygwin/non-cygwin tools.

Correct (I just repeated that above, before reading below).


> Out of curiosity (if anyone knows):
> What does "Windows Subsystem For Linux" do with line-endings?

I have not played with it yet, but my gut feel: \n endings only.  It is
emulating Linux system calls and executing actual Linux userspace
programs (where text mode does not exist).  open("rt") is thus an error
(since glibc does not support it).  But note that Windows Subsystem For
Linux is a _distinct_ subsystem (think of it more like a virtual
machine) - you CANNOT make it directly interact with native windows
programs (can't pipe data from one subsystem to another); they can only
see a common filesystem.  So Cygwin still has a niche (where you have a
program specifically compiled to Windows API using the cygwin1.dll, and
therefore operating in your normal windows subsystem).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-sed <at> gnu.org:
bug#26879; Package sed. (Fri, 12 May 2017 21:13:01 GMT) Full text and rfc822 format available.

Message #54 received at 26879-done <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Dick Dunbar <dick.dunbar <at> gmail.com>, Assaf Gordon <assafgordon <at> gmail.com>
Cc: 26879-done <at> debbugs.gnu.org
Subject: Re: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Date: Fri, 12 May 2017 16:12:38 -0500
[Message part 1 (text/plain, inline)]
On 05/12/2017 02:30 PM, Dick Dunbar wrote:
> Hi Assaf and Eric,
> Thanks for your remarks.  Very thoughtful and helpful.
> 
> 1. I hadn't realized sed had a -z option.  Here's how I used it:
>    find -print0 | sed -ze "s/^/'/" -e "s/\$/'\n/"
> 
> 2. Rather than fighting with sed behaviour, it's just easier to use Eric's
> suggestion
>     to strip the \r in a separate stage.  But this doesn't do that.  It
> replaces \r
>     with a null character followed by \n
> 
>     $ cat t.out | tr -d '\r' | od -xc

od -xc is not the nicest format; I prefer od -tx1z.  And for good reason:

>      0000000    3a43    535c    6163    5c6e    2069    322e    000a
>                    C   :   \   S   c   a   n   \   i       .   2  \n

Note that this includes things like '6163' corresponding to 'c' 'a',
which (if you think about it) looks backwards.  It really means that od
-x defaults to printing 2 bytes a type, and based on your machines
endianness, those bytes appear in swapped endian format compared to the
per-character display from -c.

You are misreading the output to assume that the final 000a means that
tr inserted a \0 in place of the \r.  What REALLY happened was that tr
DID delete the \r, but now od has an odd number of bytes, and has to PAD
the final -x output (to a 2-byte boundary) by adding \0 as the padding
for display purposes, and the padding happens to appear in the same
number where you used to see the \r character (but note that the \r\n
appeared as 0a0d, if you omit tr from the pipeline).  But if you use od
correctly, you will see that tr is indeed stripping \r after all.

> 
>     And running a stream through d2u will cause the entire pipe to
>     stall until eof on the stream.

That merely means that d2u is not the best filter. I didn't say it was
the only viable filter (and I'm sure that upstream d2u maintainers would
welcome a patch to make it not stall pipelines - but that's a topic for
that list rather than this one).

> 
> 3. Eric: the discussion of binary file open confused me.
>     Does sed default to binary open?  How would you suggest I
>     fix this in user-land?

sed, and most other cygwin programs designed for text processing,
default to using open("r") semantics (which is text-open if the file
lives on a text mount, and binary-open if the file lives on a binary
mount).  Since pipelines do not live in the file system, there is no
mount point to control whether you want the pipeline to be treated as
text or binary, so cygwin defaults pipelines to behave as if they were
in binary mode.  (At one point, the CYGWIN environment variable had an
option to choose whether all pipelines should be force-text or
force-binary, but I think it got ripped out years ago)

Prior to Feb 2017, SOME cygwin programs (sed included) had
downstream-only patches to FORCE open("rt") semantics on stdin,
including when stdin came from a pipeline.  But forcing this behavior,
while nice for text input coming from a non-cygwin program, was a
data-corruption agent for binary input coming from a cygwin program, and
could not be overridden.  So the decision was to drop the downstream
behavior. Now pipes are treated as binary mode, but YOU can override the
behavior by pre-filtering data before handing it through the pipe to sed.

> 
> I don't really understand what  'info sed' is saying because sed
> can operate on a stream -or- a file.  It's not just mixing Win programs
> and cygwin programs that causes problems.  It is very common to
> get files from multiple platforms.
> 
> Editing a 'sh' script with notepad will definitely ruin your day:
> 
> #!/bin/bash \r\n

Yes, that DOES ruin your day, if you try to execute that script on
Linux, it will fail. So the same script fails on Cygwin, unless you use
cygwin bash's downstream 'igncr' option to tell bash to ignore all \r.
Cygwin's approach is that you should opt in to ignoring \r (default
should be to behave like Linux, and only by doing something explicit can
you make life easier if you are going to be littering your data with \r
that should be ignored).

> 
> vim identifies an edited file as "dos" if it encounters one.

That's because vim ALWAYS opens files in binary mode (open("rb") rather
than open("r"), and then reproduces its OWN code to deal with line
endings).  Not every program wants to copy vim's bloat by dealing with
line endings themselves.

https://cygwin.com/cygwin-ug-net/using-textbinary.html is also a useful
resource to read.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-sed <at> gnu.org:
bug#26879; Package sed. (Fri, 12 May 2017 21:31:02 GMT) Full text and rfc822 format available.

Message #57 received at 26879-done <at> debbugs.gnu.org (full text, mbox):

From: Dick Dunbar <dick.dunbar <at> gmail.com>
To: Eric Blake <eblake <at> redhat.com>
Cc: Assaf Gordon <assafgordon <at> gmail.com>, 26879-done <at> debbugs.gnu.org
Subject: Re: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Date: Fri, 12 May 2017 14:30:52 -0700
[Message part 1 (text/plain, inline)]
On Fri, May 12, 2017 at 2:12 PM, Eric Blake <eblake <at> redhat.com> wrote:

> if you are going to be littering your data with \r
> that should be ignored
>


I'm in the middle of writing another post.   Just want to point out that I
am
not the one who is making the decision to "litter" my data with \r.

Way above my pay grade.

Nor do I have any choice over Mac's third option for line-end chars.
I just have to handle the stuff that is diverted my way  for analysis.

AND I had always depended on cygwin's "least surprise" approach to
handling this stuff.  The compilers have long ago made peace with the
end-of-line-wars ... "if you can  figure it out, so can we"

It's just white space.

Good point about the byte-swapping observation in od output.
I'll have to make an "od" alias to enforce your flags.  I like it.

I posted that because I got really bad results using that example
on my mintty console.   When the lines are diverted to the screen
they print a trailing "#" and did not honor the LF to advance to the next
line.

Perhaps I'll run this again using the 'script' command, even if script does
add ^M at the end of each line.
[Message part 2 (text/html, inline)]

Information forwarded to bug-sed <at> gnu.org:
bug#26879; Package sed. (Fri, 12 May 2017 21:41:02 GMT) Full text and rfc822 format available.

Message #60 received at 26879-done <at> debbugs.gnu.org (full text, mbox):

From: Dick Dunbar <dick.dunbar <at> gmail.com>
To: Eric Blake <eblake <at> redhat.com>
Cc: Assaf Gordon <assafgordon <at> gmail.com>, 26879-done <at> debbugs.gnu.org
Subject: Re: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Date: Fri, 12 May 2017 14:40:01 -0700
[Message part 1 (text/plain, inline)]
> I'm sure that upstream d2u maintainers would
welcome a patch to make it not stall pipelines - but that's a topic for
that list

Or, we could just stay on this list and restore the previous behaviour of
cygwin.
What was so awful about that?   Design decisions that force you to care
are usually not the right ones.  Made for "purity" instead of "usefulness".

The best filter for \r is no filter.

If you want an Opt-In flag to restore previous sed behaviour, I'm  happy to
build a
sed alias to my environment rather than unlearn decades of muscle memory
involved in just getting results.

On Fri, May 12, 2017 at 2:30 PM, Dick Dunbar <dick.dunbar <at> gmail.com> wrote:

>
> On Fri, May 12, 2017 at 2:12 PM, Eric Blake <eblake <at> redhat.com> wrote:
>
>> if you are going to be littering your data with \r
>> that should be ignored
>>
>
>
> I'm in the middle of writing another post.   Just want to point out that I
> am
> not the one who is making the decision to "litter" my data with \r.
>
> Way above my pay grade.
>
> Nor do I have any choice over Mac's third option for line-end chars.
> I just have to handle the stuff that is diverted my way  for analysis.
>
> AND I had always depended on cygwin's "least surprise" approach to
> handling this stuff.  The compilers have long ago made peace with the
> end-of-line-wars ... "if you can  figure it out, so can we"
>
> It's just white space.
>
> Good point about the byte-swapping observation in od output.
> I'll have to make an "od" alias to enforce your flags.  I like it.
>
> I posted that because I got really bad results using that example
> on my mintty console.   When the lines are diverted to the screen
> they print a trailing "#" and did not honor the LF to advance to the next
> line.
>
> Perhaps I'll run this again using the 'script' command, even if script
> does
> add ^M at the end of each line.
>
>
[Message part 2 (text/html, inline)]

Information forwarded to bug-sed <at> gnu.org:
bug#26879; Package sed. (Fri, 12 May 2017 21:47:01 GMT) Full text and rfc822 format available.

Message #63 received at 26879-done <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Dick Dunbar <dick.dunbar <at> gmail.com>
Cc: Assaf Gordon <assafgordon <at> gmail.com>, 26879-done <at> debbugs.gnu.org
Subject: Re: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Date: Fri, 12 May 2017 16:46:01 -0500
[Message part 1 (text/plain, inline)]
On 05/12/2017 04:40 PM, Dick Dunbar wrote:
>> I'm sure that upstream d2u maintainers would
> welcome a patch to make it not stall pipelines - but that's a topic for
> that list
> 
> Or, we could just stay on this list and restore the previous behaviour of
> cygwin.

But THIS list is not the list that changed cygwin behavior.  You'll want
to take that up with cygwin <at> cygwin.com.

My whole point is that you are asking upstream sed to change due to a
downstream cygwin decision, when it is downstream cygwin that you should
be complaining to.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-sed <at> gnu.org:
bug#26879; Package sed. (Fri, 12 May 2017 21:59:02 GMT) Full text and rfc822 format available.

Message #66 received at 26879-done <at> debbugs.gnu.org (full text, mbox):

From: Dick Dunbar <dick.dunbar <at> gmail.com>
To: Eric Blake <eblake <at> redhat.com>
Cc: Assaf Gordon <assafgordon <at> gmail.com>, 26879-done <at> debbugs.gnu.org
Subject: Re: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Date: Fri, 12 May 2017 14:58:14 -0700
[Message part 1 (text/plain, inline)]
Oh, I didn't realize this sed wasn't the cygwin choice.

https://debbugs.gnu.org/cgi/bugreport.cgi?bug=25707

Reading through those notes, it appears that the sed changes were
predicated on something called a "text mount".

And you were depending on the OS to " undossify_input".
And DTRT is dependent on the cygwin kernel to perform this service
so that sed/awk/grep/et.al wouldn't have to deal with it.

If so, it appears that cygwin does not have that "strip \r" functionality
and that's why it is failing for me.

How close am I getting to fully  understanding this?

I can imagine there might be a quite a lot of "sed consumers" who
will also experience this failure.  And those OS consumers also
have to deal with over-the-wall Windows and Mac files in their
environment.

I never heard of a text/binary mount point that would cause
an operating system to treat text files differently.

Do you have  pointer to some literature that explains that so I can educate
myself?
Long ago and far away, I used to be a kernel developer ... which is not the
same thing as knowing all the complexities of usage ... just how to listen
to customer complaints and make more people happy than are mad at you.

-- Still listening


On Fri, May 12, 2017 at 2:46 PM, Eric Blake <eblake <at> redhat.com> wrote:

> On 05/12/2017 04:40 PM, Dick Dunbar wrote:
> >> I'm sure that upstream d2u maintainers would
> > welcome a patch to make it not stall pipelines - but that's a topic for
> > that list
> >
> > Or, we could just stay on this list and restore the previous behaviour of
> > cygwin.
>
> But THIS list is not the list that changed cygwin behavior.  You'll want
> to take that up with cygwin <at> cygwin.com.
>
> My whole point is that you are asking upstream sed to change due to a
> downstream cygwin decision, when it is downstream cygwin that you should
> be complaining to.
>
> --
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.           +1-919-301-3266
> Virtualization:  qemu.org | libvirt.org
>
>
[Message part 2 (text/html, inline)]

Information forwarded to bug-sed <at> gnu.org:
bug#26879; Package sed. (Sat, 13 May 2017 02:18:02 GMT) Full text and rfc822 format available.

Message #69 received at 26879-done <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Dick Dunbar <dick.dunbar <at> gmail.com>
Cc: Eric Blake <eblake <at> redhat.com>, 26879-done <at> debbugs.gnu.org
Subject: Re: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Date: Fri, 12 May 2017 22:17:09 -0400
Hello,

Replying to few technical topics:

> On May 12, 2017, at 05:17, Dick Dunbar <dick.dunbar <at> gmail.com> wrote:
> 
> I want to render filenames emitted by a program ( not find) in single
> quotes so that no special characters are interpreted by the shell:
>   ( space, $, etc )

This could be a bit more tricky that it seems.

Let's start with an easy case: You know in advance files do not contain
new lines (CR/LF) nor single-quotes.
In that case, the following would work (on Linux):

  find -type f | sed -e "s/^/'/" -e "s/\$/'/"

In your cygwin case, where '\r' might be added at the end-of-line
before the '\n', we can simply discard it:

  find -type f | tr -d '\r' | sed -e "s/^/'/" -e "s/\$/'/"

But there's a problem: single-quote strings in shell can not contain
the single-quote as a character. If you have a file like this:

  touch "a'b"

Then you'll need to specifically escape it (by switching to double-quotes):

  $ touch "a'b" 'c$d' "e f"
  $ find -type f | tr -d '\r' | sed -e "s/'/'\"'\"'/g" -e "s/^/'/" -e "s/\$/'/"
  './e f'
  './a'"'"'b'
  './c$d'

Note that all these examples don't need NUL/-print0/-z because we assume
in advance that no file contains newlines.
I'm also ignoring the extra complications of CRLF vs LF
(and the possibility of a filename actually containing '\r').

----

At the risk of repeating myself, I'll just mention again that perhaps
it's worth asking *why* you want to protect the file names from special characters?
If the goal is eventually to pass them to some other program,
then consider perhaps your pipeline/script can be reworked
to use 'xargs -0' - which will pass the filenames directly (without shell
involvement) and there will be no problem with special shell characters.

E.g. to invoke a program once per file (The '%' will be replaced with the filename):

  $ find -type f -print0 | xargs -0 -I% echo ==%==
  ==./e f==
  ==./a'b==
  ==./c$d==

----


On May 12, 2017, at 05:26, Dick Dunbar <dick.dunbar <at> gmail.com> wrote:

> It is still unexplained how sed correctly finds the end-of-line correctly
> when there are no control characters at all.  ( \r, \n )

Sed works like so (more-or-less, some technical details omitted for brevity):

1. Sed reads the input until the END-OF-LINE character (\n or NUL).
2. It puts the bytes into something called "pattern space",
   WITHOUT the EOL character.
3. Any operation you perform (e.g. s/$/foo/) is done
   on the pattern-space (which does *not* contain the EOL character).
4. after executing all the sed commands,
   sed prints the content of the pattern space.
   IF the input line has an EOL character (which was removed),
   adds prints END-OF-LINE character again.

If sed does not encounter the EOL character, it reads until the
end of the input/end-of-file, performs all the send commands, then prints the content
without adding any EOL characters.
(A side note: input without terminating EOL is not POSIX-compatible.
See here for an interesting discussion about how different sed implementations
deal with lines without EOL: https://bugs.gnu.org/26574 )

To give some concrete examples:

---

printf "aaa\nbbb\nccc\n" | sed 's/[something]//'

   Above, sed uses '\n' as the EOL character. It reads
   3 lines, and performs the operation 's///' on each
   of them (once 'aaa', once 'bbb', once 'ccc').

   The character '\n' (ASCII \x0A) is NEVER stored in the buffer,
   and you can't modify it with 's///'.

---

printf "aaa\n" | sed 's/$/\n/'

  Above, sed reads the line until the '\n' (the content is 'aaa').
  The 's' commands replaces the end of the line with '\n'.
  The buffer (="pattern space" in sed lingo) becomes "aaa\n".
  sed prints it, and ALSO prints another EOL (as it does for every line).
  The result is one additional empty line in the output.


----

printf "aaa" | sed 's/./b/g'

   Above, the input did not contain EOL character ('\n').
   sed reads until the end of the input, performs the operation,
   then prints the output ('bbb') without adding a newline.
   (This is not universal for all sed implementations.)

---

printf "aaa\nbbb\nccc\n" | sed -z 's/$/x/'

   Above, 'sed -z' expect a NUL as EOL character - but there is none
   in the input - so it treats it like the previous example:
   reads the *entire* input, and performs the operation on it.
   The '\n' bytes (ASCII \x0A) have no special meaning in this case:
   sed treats them like any other bytes.
   The output will be: "aaa\nbbb\nccc\nX" .


> On May 12, 2017, at 15:30, Dick Dunbar <dick.dunbar <at> gmail.com> wrote:
> 
> 1. I hadn't realized sed had a -z option.  Here's how I used it:
>    find -print0 | sed -ze "s/^/'/" -e "s/\$/'\n/"


I hope that after the explanation about, you see that this example
won't do what you wanted: the sed command  "s/\$/'\n/"
will replace the end of the buffer (="pattern space") with "'\n",
but AFTER sed prints it, it will ALSO print the EOL character,
which is NUL (because of "-z").

To generalize:
If you use 'sed -z': both input EOL and output EOL will be NUL.
If you don't use "sed -z", both EOL and output EOL will be '\n'.
You can't easily mix them (i.e. have sed read input EOL as NUL,
but output '\n' EOL).

The only common tool I'm familiar with that can
use different EOL characters for input and output is awk, using
something like:

  find -print0 | gawk -vRS="\0" -vORS="\n" '{ print "file = " $0 }'

But I wouldn't recommend it.

Instead, I'd recommend the following:

   find [criteria] -print0 | tr -d '\r' \
       | sed -z 's/SOMETHING//' | tr '\0' '\n'

And a complete command:

  $ touch "a'b" 'c$d' 'e f' "$(printf 'g\nh')"
  $ find -type f -print0 \
         | tr -d '\r' \
         | sed -z -e "s/'/'\"'\"'/g" -e "s/^/'/" -e "s/\$/'/" \
         | tr '\0' '\n'                                                                                 
  './e f'
  './a'"'"'b'
  './g
  h'
  './c$d'


I think above "should work", but I haven't tested it on cygwin.
(Comments from others are very welcomed.)

regards,
 - assaf






Information forwarded to bug-sed <at> gnu.org:
bug#26879; Package sed. (Sat, 13 May 2017 02:34:02 GMT) Full text and rfc822 format available.

Message #72 received at 26879-done <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Dick Dunbar <dick.dunbar <at> gmail.com>
Cc: Eric Blake <eblake <at> redhat.com>, 26879-done <at> debbugs.gnu.org
Subject: Re: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Date: Fri, 12 May 2017 22:33:44 -0400
> On May 12, 2017, at 22:17, Assaf Gordon <assafgordon <at> gmail.com> wrote:
> 
>  $ touch "a'b" 'c$d' 'e f' "$(printf 'g\nh')"
>  $ find -type f -print0 \
>         | tr -d '\r' \
>         | sed -z -e "s/'/'\"'\"'/g" -e "s/^/'/" -e "s/\$/'/" \
>         | tr '\0' '\n'                                                                                 
>  './e f'
>  './a'"'"'b'
>  './g
>  h'
>  './c$d'

Correcting myself:
there is no need for the "tr -d '\r'" in the above example:
since "find" uses "-print0" - it will NOT print LF or CRLF as line-endings,
and so there's nothing to remove.

-assaf





Information forwarded to bug-sed <at> gnu.org:
bug#26879; Package sed. (Sat, 13 May 2017 06:40:01 GMT) Full text and rfc822 format available.

Message #75 received at 26879-done <at> debbugs.gnu.org (full text, mbox):

From: Dick Dunbar <dick.dunbar <at> gmail.com>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: Eric Blake <eblake <at> redhat.com>, 26879-done <at> debbugs.gnu.org
Subject: Re: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Date: Fri, 12 May 2017 23:39:14 -0700
[Message part 1 (text/plain, inline)]
Got it.  Thanks for the examples.  I guess I don't understand why my little
filter worked for so many years on so many platforms that I never gave it
a second thought ... or ever tried to over-think the problem.

I'll go back to MobaXterm environment and retry my code.
If it works ... I'll remove cygwin and never look back.

I'm having what you would call a "crisis of confidence".  :-)

- Cheers

On Fri, May 12, 2017 at 7:33 PM, Assaf Gordon <assafgordon <at> gmail.com> wrote:

>
> > On May 12, 2017, at 22:17, Assaf Gordon <assafgordon <at> gmail.com> wrote:
> >
> >  $ touch "a'b" 'c$d' 'e f' "$(printf 'g\nh')"
> >  $ find -type f -print0 \
> >         | tr -d '\r' \
> >         | sed -z -e "s/'/'\"'\"'/g" -e "s/^/'/" -e "s/\$/'/" \
> >         | tr '\0' '\n'
> >  './e f'
> >  './a'"'"'b'
> >  './g
> >  h'
> >  './c$d'
>
> Correcting myself:
> there is no need for the "tr -d '\r'" in the above example:
> since "find" uses "-print0" - it will NOT print LF or CRLF as line-endings,
> and so there's nothing to remove.
>
> -assaf
>
>
[Message part 2 (text/html, inline)]

Information forwarded to bug-sed <at> gnu.org:
bug#26879; Package sed. (Mon, 15 May 2017 13:06:01 GMT) Full text and rfc822 format available.

Message #78 received at 26879-done <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Dick Dunbar <dick.dunbar <at> gmail.com>
Cc: Assaf Gordon <assafgordon <at> gmail.com>, 26879-done <at> debbugs.gnu.org
Subject: Re: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Date: Sat, 13 May 2017 15:05:59 -0500
[Message part 1 (text/plain, inline)]
On 05/12/2017 04:58 PM, Dick Dunbar wrote:
> Oh, I didn't realize this sed wasn't the cygwin choice.

[please don't top-post on technical lists]

What do you mean by "this sed wasn't the cygwin choice"? Cygwin is using
GNU sed, and has been for many years.  In fact, I currently package sed
for cygwin downstream.

But the downstream choice of whether to add a hack to use open("rt") or
to use the upstream behavior of open("r") [or, in the case of the 4.4-1
package, to remove the hack] is exactly that - something that was
decided downstream by Cygwin people. Upstream has no control over what
additional patches (if any) downstream wants to use or avoid.  This list
is the upstream list. If you want your complaints to be heard by the
cygwin community at large, so other cygwin users can chime in on the
behavior that is best for the Cygwin distribution, then reach out to the
cygwin community, not this list.

> 
> https://debbugs.gnu.org/cgi/bugreport.cgi?bug=25707
> 
> Reading through those notes, it appears that the sed changes were
> predicated on something called a "text mount".

Yes, a cygwin text mount is what controls whether open("r") will strip
\r (stripped from a text mount, left intact from a binary mount).

> 
> And you were depending on the OS to " undossify_input".

undossify_input is not in the sed sources.  Now you are pointing to a
grep bug (yes, it is related, but let's be careful on what we are
attributing to sed, vs. what we are attributing to downstream).

For years, the grep project had a function named undossify_input that
tried to manually strip \r from files - except that it didn't do as
advertised. It didn't do anything on text mount files (since \r was
already stripped), and it incorrectly removed \r from binary files
(where the whole point of a binary file is that it is NOT supposed to
have \r stripped).  So, as part of downstream cygwin unifying the
behavior of grep, sed, and awk, I pointed out that we could apply an
upstream patch to grep to greatly simplify the upstream source by
ripping out special-case code for Cygwin that wasn't correct in the
first place.

> And DTRT is dependent on the cygwin kernel to perform this service
> so that sed/awk/grep/et.al wouldn't have to deal with it.

Yes, you generally want the right behavior to be done at a common place,
so that it doesn't have to be duplicated everywhere else.

> 
> If so, it appears that cygwin does not have that "strip \r" functionality
> and that's why it is failing for me.

Huh? Cygwin DOES have the ability to strip \r from files. You get it by
mounting the directory containing the file as a text mount.

Maybe you are asking whether cygwin should have an ability to
automatically strip \r from pipelines where the source end of the pipe
is not a native cygwin process (and therefore more likely to be
producing \r), since pipelines are not a file system that you mount and
therefore can't be given a mount option of text-vs-binary.  I don't know
of such an ability at the present (years ago, you could set a substring
in the $CYGWIN environment variable to force ALL pipelines to be in text
mode, but it was removed years ago because of the problems it used to
cause).  But maybe it is worth proposing a patch to add such a
capability back into cygwin1.dll.  And doing it on JUST pipelines
connected to native processes, rather than all pipelines, will let
cygwin continue to behave sanely on pipelines from other cygwin
processes (where you WANT binary handline).

But such a patch would be to cygwin1.dll (which is NOT maintained here),
so you'd have to propose it downstream to the cygwin list.

> 
> How close am I getting to fully  understanding this?

It's hard for me to say, because I feel like I have been repeating the
same things.  In particular, I've repeated my plea for you to take this
to the cygwin list, and yet here you are still asking on the upstream
sed list.

> 
> I can imagine there might be a quite a lot of "sed consumers" who
> will also experience this failure.

There have been one or two threads on the cygwin list in the past three
months by other people hit by surprise that the intermixing of native
windows processes into cygwin sed/grep/awk have changed behavior, but
surprisingly not many by the standards of how much other volume the
cygwin list gets.  For example:
https://cygwin.com/ml/cygwin/2017-05/msg00161.html

>  And those OS consumers also
> have to deal with over-the-wall Windows and Mac files in their
> environment.

If you are already dealing with files created by one OS and mounted in
another OS, then you should already be quite familiar with how to filter
your files to have desirable line endings for the system where you plan
to process the file.

Furthermore, Cygwin tries hard to emulate Linux. How would you process a
file containing \r\n that you copied onto Linux? That's exactly the same
way you should process that file on cygwin (at least, with default
binary mounts).  The fact that cygwin used to have a hack where it
filtered \r on your behalf (unlike what Linux would do), and now no
longer has that hack, should not affect you if you had already been
stripping \r yourself.  And the fact that the hack corrupted binary
data, and now cygwin can process binary data without corruption, is one
of the stronger justifications why cygwin maintainers decided to remove
the hack.  There was a long email thread on the subject at the time:
https://cygwin.com/ml/cygwin/2017-02/threads.html#00152
(titled Updated [test]: sed-4.4-1)

and probably several others that I didn't bother to locate while typing
this.

> 
> I never heard of a text/binary mount point that would cause
> an operating system to treat text files differently.
> 
> Do you have  pointer to some literature that explains that so I can educate
> myself?

How about these pages of the Cygwin documentation:
https://cygwin.com/cygwin-ug-net/using-textbinary.html
https://cygwin.com/cygwin-ug-net/using.html#mount-table

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-sed <at> gnu.org:
bug#26879; Package sed. (Mon, 15 May 2017 22:56:02 GMT) Full text and rfc822 format available.

Message #81 received at 26879-done <at> debbugs.gnu.org (full text, mbox):

From: Dick Dunbar <dick.dunbar <at> gmail.com>
To: Eric Blake <eblake <at> redhat.com>
Cc: Assaf Gordon <assafgordon <at> gmail.com>, 26879-done <at> debbugs.gnu.org
Subject: Re: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Date: Mon, 15 May 2017 15:55:44 -0700
[Message part 1 (text/plain, inline)]
You've been generous and patient.  Thanks Eric.  Especially appreciate
those url's.
Still avoiding cygwin list because this is universal; Peace on Earth
restored.


1. bash-on-linux-on-windows ... is ubuntu ported distro.
   $ uname -r
      4.4.0-43-Microsoft
   $ echo $TERM
      xterm-256color
   $ sudo apt install dos2unix
   $ winpgm | head -5 > x.out
   $ vi x.out    ## vi identifies it as "dos" file

2. Windows WSL ends each output line with "#" ; (  it's not term related.
cygwin does too )

    $ alias od="/usr/bin/od -tx1z"     ## I like your choice. I have a
similar debugging .h file
    $ od x.out | head -6

    0000000 43 3a 5c 54 6f 6f 6c 73 5c 4d 69 63 72 6f 73 6f
>C:\Tools\Microso<
    0000020 66 74 20 56 69 73 75 61 6c 20 53 74 75 64 69 6f  >ft Visual
Studio<
    0000040 5c 32 30 31 37 5c 43 6f 6d 6d 75 6e 69 74 79 5c
>\2017\Community\<
    0000060 57 65 62 5c 45 78 74 65 72 6e 61 6c 5c 6e 6f 64
>Web\External\nod<
    0000100 65 5f 6d 6f 64 75 6c 65 73 5c 65 73 35 2d 65 78
>e_modules\es5-ex<
    0000120 74 5c 61 72 72 61 79 5c 23 0d 0a 43 3a 5c 54 6f
>t\array\#..C:\To<

3. A universal filter for windows might look like this:

   Encapsulated in a sed file filter ( because bash/alias quoting is too
obscure )
   $ cat ~/qt.sed
     #!/bin/sh
     dos2unix | sed -e "s/\#$//" -e "s/^.*/'&'/"

  Also works with Mac files.
   $ cat ~/qtm.sed
     #!/bin/sh
     dos2unix -c mac| sed -e "s/\#$//" -e "s/^.*/'&'/"

   $ chmod +x *.sed

   $ cat ~/.alias
     alias qt='~/qt.sed'
     alias qtm='~/qtm.sed'

   $ source .alias

    $ cat x.out | qt      ## Works on native linux (\n) output too

    $ touch 'phuny #1 file'

    $ ls -1 ph* |dos2unix |sed -e "s/\#$//" -e "s/^.*/'&'/"
      'phuny #1 file'
      'physmem.exe'

   $ ls -1 | qt    # works
   $ ls -1 |qtm  # also works on linux


4. The resolution works on Ubuntu 16.04 (  VirtualBox installation )
     $ uname -r
        4.8.0-51.generic
     $ sudo apt install dos2unix

It's resolved in a straight-forward  and robust way.  Thanks for the help.
[Message part 2 (text/html, inline)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 13 Jun 2017 11:24:06 GMT) Full text and rfc822 format available.

This bug report was last modified 6 years and 312 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.