GNU bug report logs - #13947
bug report for core-utils command : OD

Previous Next

Package: coreutils;

Reported by: Marc Grondin <marc.grondin <at> oracle.com>

Date: Wed, 13 Mar 2013 20:25:02 UTC

Severity: normal

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 13947 in the body.
You can then email your comments to 13947 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#13947; Package coreutils. (Wed, 13 Mar 2013 20:25:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Marc Grondin <marc.grondin <at> oracle.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Wed, 13 Mar 2013 20:25:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Marc Grondin <marc.grondin <at> oracle.com>
To: <bug-coreutils <at> gnu.org>
Cc: Mark.Jaeger <at> oracle.com
Subject: bug report for core-utils command :  OD
Date: Wed, 13 Mar 2013 13:16:16 -0700 (PDT)
Good Afternoon, 

My client was attempting to run the command : od -c on this xml file (sample only) 
------------------------------------------------------------------------------
<?xml version = '1.0' encoding = 'UTF-8'?>
<top>
   <x>丸</x>
   <y>丸</y>
   <z>𠄌</z>
   <x>?</x>
   <x>?</x>
   <x>?丸</x>
   <x>??丸</x>
</top>
------------------------------------------------------------------------------

note : this system is a : 2.6.18-164.0.0.0.1.el5xen #1 SMP Thu Sep 3 00:34:43 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

He was getting this output : 
------------------------------------------------------------------------------
0000000   <   ?   x   m   l       v   e   r   s   i   o   n       =    
0000020   '   1   .   0   '       e   n   c   o   d   i   n   g       =
0000040       '   U   T   F   -   8   '   ?   >  \n   <   t   o   p   >
0000060  \n               <   x   >   �   �   �   <   /   x   >  \n    
0000100           <   y   >   �   �   � 201   <   /   y   >  \n        
0000120       <   z   >   �   � 204 214   <   /   z   >  \n            
0000140   <   x   >   ?   <   /   x   >  \n               <   x   >   ?
0000160   <   /   x   >  \n               <   x   >   ?   �   �   � 201
0000200   <   /   x   >  \n               <   x   >   ?   ?   �   �   �
0000220 201   <   /   x   >  \n   <   /   t   o   p   >  \n
------------------------------------------------------------------------------

Instead of this : 
------------------------------------------------------------------------------
000000   <   ?   x   m   l       v   e   r   s   i   o   n       =    
0000020   '   1   .   0   '       e   n   c   o   d   i   n   g       =
0000040       '   U   T   F   -   8   '   ?   >  \n   <   t   o   p   >
0000060  \n               <   x   > 344 270 270   <   /   x   >  \n    
0000100           <   y   > 360 257 240 201   <   /   y   >  \n        
0000120       <   z   > 360 240 204 214   <   /   z   >  \n            
0000140   <   x   >   ?   <   /   x   >  \n               <   x   >   ?
0000160   <   /   x   >  \n               <   x   >   ? 360 257 240 201
0000200   <   /   x   >  \n               <   x   >   ?   ? 360 257 240
0000220 201   <   /   x   >  \n   <   /   t   o   p   >  \n
0000235
------------------------------------------------------------------------------

This all based on the LANG env.  He was using : 
LANG=en_US.iso88591, instead of
LANG=en_US.UTF-8 

------------------------------------------------------------------------------

Question : 
Since the output is based on the ASCII character set, should it not, in both cases give a numerical output (as it did in scenario #2) 
for a symbol outside the ascii/extended-ascii character set ? 
------------------------------------------------------------------------------


Regards, 

Marc Grondin, 

__________________________________
Oracle - Quebec city, Qc.
Senior System Administrator, PDIT
---------------------------------
400-330 St-Vallier, G1K 9C5
418.524.5665 # 1256
=================================




Information forwarded to bug-coreutils <at> gnu.org:
bug#13947; Package coreutils. (Wed, 13 Mar 2013 21:36:02 GMT) Full text and rfc822 format available.

Message #8 received at 13947 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Marc Grondin <marc.grondin <at> oracle.com>
Cc: Mark.Jaeger <at> oracle.com, 13947 <at> debbugs.gnu.org
Subject: Re: bug#13947: bug report for core-utils command :  OD
Date: Wed, 13 Mar 2013 15:34:14 -0600
[Message part 1 (text/plain, inline)]
On 03/13/2013 02:16 PM, Marc Grondin wrote:
> Good Afternoon, 

Hello, and thanks for the report.

> 
> My client was attempting to run the command : od -c on this xml file (sample only) 
> ------------------------------------------------------------------------------
> <?xml version = '1.0' encoding = 'UTF-8'?>
> <top>
>    <x>丸</x>

Here, you are representing a character in UTF-8

> He was getting this output : 
> ------------------------------------------------------------------------------
> 0000000   <   ?   x   m   l       v   e   r   s   i   o   n       =    
> 0000020   '   1   .   0   '       e   n   c   o   d   i   n   g       =
> 0000040       '   U   T   F   -   8   '   ?   >  \n   <   t   o   p   >
> 0000060  \n               <   x   >   �   �   �   <   /   x   >  \n    

and here, you were running od in a different character set:

> This all based on the LANG env.  He was using : 
> LANG=en_US.iso88591, instead of
> LANG=en_US.UTF-8 

In ISO-88591, every byte is a character, and those particular bytes
happen to be printable, so od was faithfully replaying the character as
printable, only to then be shown by your UTF-8 terminal as an invalid
UTF-8 sequence.  Mismatching character sets between your program and
your terminal is always a recipe for confusion.

However, you HAVE identified a bug, in our documentation.

> 
> ------------------------------------------------------------------------------
> 
> Question : 
> Since the output is based on the ASCII character set, should it not, in both cases give a numerical output (as it did in scenario #2) 
> for a symbol outside the ascii/extended-ascii character set ? 

Our documentation is lying.  Here's what POSIX says about od -c:

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/od.html
"Interpret bytes as characters specified by the current setting of the
LC_CTYPE category. Certain non-graphic characters appear as C escapes:
"NUL=\0" , "BS=\b" , "FF=\f" , "NL=\n" , "CR=\r" , "HT=\t" ; others
appear as 3-digit octal numbers."

Nothing in there restricts the output to ASCII only.  The bytes that are
showing up as � are graphic characters in your current choice of
LC_CTYPE, so there is no escaping performed (since escaping is permitted
only on non-graphic characters).  If your terminal was using the same
character set as you ran od under, you would see proper graphical
characters in the ISO-88591 set (but then again, you wouldn't see the
nice 丸 character that the UTF-8 was representing).

Coreutils is properly obeying the locale, what is wrong is the info
documentation which stated:

`-c'
     Output as ASCII characters or backslash escapes.

In reality, that should state something like:
     Output as characters in the current locale, using octal sequences
or backslash escapes for all non-graphic bytes.

Meanwhile, if you want to guarantee ASCII-only output from od, you have
to use a different format, such as -b or -tx1, or use LC_ALL=C on a
system where the C locale does not treat non-ascii bytes as graphical
characters (most glibc systems, including the one you are using, fit
this bill).

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#13947; Package coreutils. (Wed, 13 Mar 2013 21:55:02 GMT) Full text and rfc822 format available.

Message #11 received at 13947 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Eric Blake <eblake <at> redhat.com>
Cc: Mark.Jaeger <at> oracle.com, Marc Grondin <marc.grondin <at> oracle.com>,
	13947 <at> debbugs.gnu.org
Subject: Re: bug#13947: bug report for core-utils command :  OD
Date: Wed, 13 Mar 2013 21:53:39 +0000
On 03/13/2013 09:34 PM, Eric Blake wrote:
> On 03/13/2013 02:16 PM, Marc Grondin wrote:
>> Good Afternoon, 
> 
> Hello, and thanks for the report.
> 
>>
>> My client was attempting to run the command : od -c on this xml file (sample only) 
>> ------------------------------------------------------------------------------
>> <?xml version = '1.0' encoding = 'UTF-8'?>
>> <top>
>>    <x>丸</x>
> 
> Here, you are representing a character in UTF-8
> 
>> He was getting this output : 
>> ------------------------------------------------------------------------------
>> 0000000   <   ?   x   m   l       v   e   r   s   i   o   n       =    
>> 0000020   '   1   .   0   '       e   n   c   o   d   i   n   g       =
>> 0000040       '   U   T   F   -   8   '   ?   >  \n   <   t   o   p   >
>> 0000060  \n               <   x   >   �   �   �   <   /   x   >  \n    
> 
> and here, you were running od in a different character set:
> 
>> This all based on the LANG env.  He was using : 
>> LANG=en_US.iso88591, instead of
>> LANG=en_US.UTF-8 
> 
> In ISO-88591, every byte is a character, and those particular bytes
> happen to be printable, so od was faithfully replaying the character as
> printable, only to then be shown by your UTF-8 terminal as an invalid
> UTF-8 sequence.  Mismatching character sets between your program and
> your terminal is always a recipe for confusion.
> 
> However, you HAVE identified a bug, in our documentation.
> 
>>
>> ------------------------------------------------------------------------------
>>
>> Question : 
>> Since the output is based on the ASCII character set, should it not, in both cases give a numerical output (as it did in scenario #2) 
>> for a symbol outside the ascii/extended-ascii character set ? 
> 
> Our documentation is lying.  Here's what POSIX says about od -c:
> 
> http://pubs.opengroup.org/onlinepubs/9699919799/utilities/od.html
> "Interpret bytes as characters specified by the current setting of the
> LC_CTYPE category. Certain non-graphic characters appear as C escapes:
> "NUL=\0" , "BS=\b" , "FF=\f" , "NL=\n" , "CR=\r" , "HT=\t" ; others
> appear as 3-digit octal numbers."
> 
> Nothing in there restricts the output to ASCII only.  The bytes that are
> showing up as � are graphic characters in your current choice of
> LC_CTYPE, so there is no escaping performed (since escaping is permitted
> only on non-graphic characters).  If your terminal was using the same
> character set as you ran od under, you would see proper graphical
> characters in the ISO-88591 set (but then again, you wouldn't see the
> nice 丸 character that the UTF-8 was representing).
> 
> Coreutils is properly obeying the locale, what is wrong is the info
> documentation which stated:
> 
> `-c'
>      Output as ASCII characters or backslash escapes.

I agree. Thanks for the detailed description.

> In reality, that should state something like:

>      Output as characters in the current locale, using octal sequences
> or backslash escapes for all non-graphic bytes.

Note we output spaces, so I'd s/non-graphic/non-printable/.

Also multi byte is always going to be problematic displaying
in a grid like this, so we'll probably continue to do as
we do now for the utf8 example above and output octal and dots.
So therefore s/characters/single byte characters/.

> 
> Meanwhile, if you want to guarantee ASCII-only output from od, you have
> to use a different format, such as -b or -tx1, or use LC_ALL=C on a
> system where the C locale does not treat non-ascii bytes as graphical
> characters (most glibc systems, including the one you are using, fit
> this bill).
> 

cheers,
Pádraig.




Reply sent to Pádraig Brady <P <at> draigBrady.com>:
You have taken responsibility. (Fri, 22 Mar 2013 15:48:02 GMT) Full text and rfc822 format available.

Notification sent to Marc Grondin <marc.grondin <at> oracle.com>:
bug acknowledged by developer. (Fri, 22 Mar 2013 15:48:03 GMT) Full text and rfc822 format available.

Message #16 received at 13947-done <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Eric Blake <eblake <at> redhat.com>
Cc: Mark.Jaeger <at> oracle.com, Marc Grondin <marc.grondin <at> oracle.com>,
	13947-done <at> debbugs.gnu.org
Subject: Re: bug#13947: bug report for core-utils command :  OD
Date: Fri, 22 Mar 2013 15:45:49 +0000
[Message part 1 (text/plain, inline)]
On 03/13/2013 09:53 PM, Pádraig Brady wrote:
> On 03/13/2013 09:34 PM, Eric Blake wrote:
>> In reality, that should state something like:
> 
>>      Output as characters in the current locale, using octal sequences
>> or backslash escapes for all non-graphic bytes.
> 
> Note we output spaces, so I'd s/non-graphic/non-printable/.
> 
> Also multi byte is always going to be problematic displaying
> in a grid like this, so we'll probably continue to do as
> we do now for the utf8 example above and output octal and dots.
> So therefore s/characters/single byte characters/.

Hopefully the attached clarifies things.

thanks,
Pádraig.
[od-printable.patch (text/x-patch, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#13947; Package coreutils. (Fri, 22 Mar 2013 16:06:01 GMT) Full text and rfc822 format available.

Message #19 received at 13947-done <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Pádraig Brady <P <at> draigBrady.com>
Cc: Mark.Jaeger <at> oracle.com, Marc Grondin <marc.grondin <at> oracle.com>,
	13947-done <at> debbugs.gnu.org
Subject: Re: bug#13947: bug report for core-utils command :  OD
Date: Fri, 22 Mar 2013 10:03:47 -0600
[Message part 1 (text/plain, inline)]
On 03/22/2013 09:45 AM, Pádraig Brady wrote:
> Hopefully the attached clarifies things.

> * src/od.c (usage): Mention any printable character is output,
> Not just ASCII.
> * doc/coreutils.texi (od invocation): Further clarify that only
> single byte characters are output (due to the alignment requirement).
> Reported in http://bugs.gnu.org/13947

Yes, this looks good to me.  It could go in as-is, but see my question
below...

> ---
>  doc/coreutils.texi |    6 +++---
>  src/od.c           |    4 ++--
>  2 files changed, 5 insertions(+), 5 deletions(-)
> 

>  @table @samp
>  @item a
>  named character, ignoring high-order bit
>  @item c
> -ASCII character or backslash escape,
> +printable single byte character or backslash escape,

Hmm, we output octal sequences without a backslash; should the info page
be any more verbose that it is one of: a single-byte printable
character, a C backslash escape, or an octal sequence?  Or does that
just clutter things (seeing three octal digits, even without a
backslash, still makes it easy to determine that it can be used as an
escape sequence).

> +++ b/src/od.c
> @@ -339,7 +339,7 @@ suffixes may be . for octal and b for multiply by 512.\n\
>  Traditional format specifications may be intermixed; they accumulate:\n\
>    -a   same as -t a,  select named characters, ignoring high-order bit\n\
>    -b   same as -t o1, select octal bytes\n\
> -  -c   same as -t c,  select ASCII characters or backslash escapes\n\
> +  -c   same as -t c,  select printable characters or backslash escapes\n\

For the --help output, terse is good, so I don't see any improvements to
your change here.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#13947; Package coreutils. (Fri, 22 Mar 2013 16:16:02 GMT) Full text and rfc822 format available.

Message #22 received at 13947 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Eric Blake <eblake <at> redhat.com>
Cc: Mark.Jaeger <at> oracle.com, Marc Grondin <marc.grondin <at> oracle.com>,
	13947 <at> debbugs.gnu.org
Subject: Re: bug#13947: bug report for core-utils command :  OD
Date: Fri, 22 Mar 2013 16:13:30 +0000
On 03/22/2013 04:03 PM, Eric Blake wrote:
> On 03/22/2013 09:45 AM, Pádraig Brady wrote:
>>  @table @samp
>>  @item a
>>  named character, ignoring high-order bit
>>  @item c
>> -ASCII character or backslash escape,
>> +printable single byte character or backslash escape,
> 
> Hmm, we output octal sequences without a backslash; should the info page
> be any more verbose that it is one of: a single-byte printable
> character, a C backslash escape, or an octal sequence?  Or does that
> just clutter things (seeing three octal digits, even without a
> backslash, still makes it easy to determine that it can be used as an
> escape sequence).

Good point.
I'll make that clarification in the same commit as it at least confirms the
behavior is intended. POSIX is explicit about the three possibilities.

thanks,
Pádraig.




Information forwarded to bug-coreutils <at> gnu.org:
bug#13947; Package coreutils. (Wed, 27 Mar 2013 18:47:02 GMT) Full text and rfc822 format available.

Message #25 received at 13947-done <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Mark.Jaeger <at> oracle.com
Cc: marc.grondin <at> oracle.com, P <at> draigBrady.com, 13947-done <at> debbugs.gnu.org
Subject: Re: bug#13947: bug report for core-utils command :  OD
Date: Wed, 27 Mar 2013 12:44:08 -0600
[Message part 1 (text/plain, inline)]
On 03/27/2013 12:39 PM, Mark JAEGER wrote:
> Hello Eric,
> 
> The terms "single-byte character" and "single-byte
> printable character" do not sound precise to me.

They are precise - they are characters in the encoding determined by the
current setting of LC_CTYPE.

> 
> A byte is just a byte.  It is NOT a character.
> I.e., it is an octet, or an 8-bit quantity.
> 
> It CAN be interpreted as a character, but only in
> the context of a particular ENCODING.

Yes, but the ENCODING is always known, thanks to the rules on LC_* and
locale handling.

> 
> The help text as it stands now IS precise in talking
> about ASCII, which IS a particular encoding.
> 
> Please don't use the term "single-byte ... character"
> without being precise about what encoding it uses.

The encoding is whatever encoding you asked for.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#13947; Package coreutils. (Wed, 27 Mar 2013 19:25:02 GMT) Full text and rfc822 format available.

Message #28 received at 13947-done <at> debbugs.gnu.org (full text, mbox):

From: Mark JAEGER <Mark.Jaeger <at> oracle.com>
To: eblake <at> redhat.com
Cc: marc.grondin <at> oracle.com, P <at> draigBrady.com, 13947-done <at> debbugs.gnu.org
Subject: Re: bug#13947: bug report for core-utils command :  OD
Date: Wed, 27 Mar 2013 11:39:08 -0700 (PDT)
[Message part 1 (text/plain, inline)]
Hello Eric,

The terms "single-byte character" and "single-byte
printable character" do not sound precise to me.

A byte is just a byte.  It is NOT a character.
I.e., it is an octet, or an 8-bit quantity.

It CAN be interpreted as a character, but only in
the context of a particular ENCODING.

The help text as it stands now IS precise in talking
about ASCII, which IS a particular encoding.

Please don't use the term "single-byte ... character"
without being precise about what encoding it uses.

Regards,

--Mark JAEGER                  phone: 312-651-8329
Sustaining Engineering (formerly DDR)
Server Technologies, Oracle   e-mail: Mark.Jaeger <at> oracle.com


On Fri, 22 Mar 2013, Eric Blake wrote:

> Date: Fri, 22 Mar 2013 10:03:47 -0600
> From: Eric Blake <eblake <at> redhat.com>
> To: Pádraig Brady <P <at> draigBrady.com>
> Cc: Mark.Jaeger <at> oracle.com, Marc Grondin <marc.grondin <at> oracle.com>,
>     13947-done <at> debbugs.gnu.org
> Subject: Re: bug#13947: bug report for core-utils command :  OD
> 
> On 03/22/2013 09:45 AM, Pádraig Brady wrote:
>> Hopefully the attached clarifies things.
>
>> * src/od.c (usage): Mention any printable character is output,
>> Not just ASCII.
>> * doc/coreutils.texi (od invocation): Further clarify that only
>> single byte characters are output (due to the alignment requirement).
>> Reported in http://bugs.gnu.org/13947
>
> Yes, this looks good to me.  It could go in as-is, but see my question
> below...
>
>> ---
>>  doc/coreutils.texi |    6 +++---
>>  src/od.c           |    4 ++--
>>  2 files changed, 5 insertions(+), 5 deletions(-)
>>
>
>>  @table @samp
>>  @item a
>>  named character, ignoring high-order bit
>>  @item c
>> -ASCII character or backslash escape,
>> +printable single byte character or backslash escape,
>
> Hmm, we output octal sequences without a backslash; should the info page
> be any more verbose that it is one of: a single-byte printable
> character, a C backslash escape, or an octal sequence?  Or does that
> just clutter things (seeing three octal digits, even without a
> backslash, still makes it easy to determine that it can be used as an
> escape sequence).
>
>> +++ b/src/od.c
>> @@ -339,7 +339,7 @@ suffixes may be . for octal and b for multiply by 512.\n\
>>  Traditional format specifications may be intermixed; they accumulate:\n\
>>    -a   same as -t a,  select named characters, ignoring high-order bit\n\
>>    -b   same as -t o1, select octal bytes\n\
>> -  -c   same as -t c,  select ASCII characters or backslash escapes\n\
>> +  -c   same as -t c,  select printable characters or backslash escapes\n\
>
> For the --help output, terse is good, so I don't see any improvements to
> your change here.
>
> -- 
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
>
>

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 25 Apr 2013 11:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 11 years and 20 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.