GNU bug report logs - #11197
problems with string ports and unicode

Previous Next

Package: guile;

Reported by: Klaus Stehle <klaus.stehle <at> uni-tuebingen.de>

Date: Sat, 7 Apr 2012 20:09:01 UTC

Severity: normal

Done: ludo <at> gnu.org (Ludovic Courtès)

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 11197 in the body.
You can then email your comments to 11197 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-guile <at> gnu.org:
bug#11197; Package guile. (Sat, 07 Apr 2012 20:09:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Klaus Stehle <klaus.stehle <at> uni-tuebingen.de>:
New bug report received and forwarded. Copy sent to bug-guile <at> gnu.org. (Sat, 07 Apr 2012 20:09:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Klaus Stehle <klaus.stehle <at> uni-tuebingen.de>
To: bug-guile <at> gnu.org
Subject: problems with string ports and unicode
Date: Sat, 7 Apr 2012 22:07:01 +0200 (CEST)
[Message part 1 (text/plain, inline)]
Hi,

;;;; a very very short example script to describe the problem:

;; open a string port with unicode characters >= 0x0100
(define p (open-input-string "čtyří"))


Put the line into a script and start guile. You will see the output:
=> Backtrace:

That's all, and guile will hang in an eternal loop.

If you enter the line interactively into the REPL, everything works
properly and you can read all characters with (read-char p).



;;;; another very short script, which is possibly the same problem:

;; open a string port and unread a unicode character >= 0x0100
(define p (open-input-string "ibenik"))
(unread-char #\Š p)


Running these two lines as a script generates an error message:
=> ERROR: In procedure unread-char:
=> ERROR: Throw to key `encoding-error' with args
          `("scm_ungetc" "conversion to port encoding failed" 84 #f #\540)'.

If you enter the lines interactively into the REPL, everything works
properly and you can read all characters with (read-char p).


Cheers,
Klaus Stehle


----------------------------
guile --version
guile (GNU Guile) 2.0.5

uname -srm
Linux 2.6.32-5-amd64 x86_64

echo $LANG
de_DE.UTF-8

Information forwarded to bug-guile <at> gnu.org:
bug#11197; Package guile. (Mon, 09 Apr 2012 21:14:02 GMT) Full text and rfc822 format available.

Message #8 received at 11197 <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: Klaus Stehle <klaus.stehle <at> uni-tuebingen.de>
Cc: 11197 <at> debbugs.gnu.org
Subject: Re: bug#11197: problems with string ports and unicode
Date: Mon, 09 Apr 2012 23:12:29 +0200
Hi,

It may be that your string ports are created with a non-Unicode-capable
encoding.  Try something like:

  (define p
    (with-fluids ((%default-port-encoding "UTF-8"))
      (open-input-string "čtyří")))

More details in the manual (info "(guile) String Ports").

How does it work for you?

Ludo’.




Information forwarded to bug-guile <at> gnu.org:
bug#11197; Package guile. (Wed, 11 Apr 2012 16:13:02 GMT) Full text and rfc822 format available.

Message #11 received at 11197 <at> debbugs.gnu.org (full text, mbox):

From: Mark H Weaver <mhw <at> netris.org>
To: ludo <at> gnu.org (Ludovic Courtès)
Cc: 11197 <at> debbugs.gnu.org, Klaus Stehle <klaus.stehle <at> uni-tuebingen.de>
Subject: Re: bug#11197: problems with string ports and unicode
Date: Wed, 11 Apr 2012 12:08:09 -0400
ludo <at> gnu.org (Ludovic Courtès) writes:
> It may be that your string ports are created with a non-Unicode-capable
> encoding.  Try something like:
>
>   (define p
>     (with-fluids ((%default-port-encoding "UTF-8"))
>       (open-input-string "čtyří")))

IMO, this should not be needed.  Port encodings should only be relevant
when reading from ports involving byte strings, such as file ports or
socket ports.  The encoding used by Scheme strings is a purely internal
matter; from the user's perspective, Scheme strings are simply a
sequence of Unicode code points.

What _is_ needed is a file coding declaration near the top of the source
file, e.g. "coding: utf-8" (see "Character Encoding of Source Files" in
the manual).  I tried that and it still fails for me.

I think this is a genuine bug.

     Mark




Information forwarded to bug-guile <at> gnu.org:
bug#11197; Package guile. (Wed, 11 Apr 2012 16:27:01 GMT) Full text and rfc822 format available.

Message #14 received at 11197 <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: Mark H Weaver <mhw <at> netris.org>
Cc: 11197 <at> debbugs.gnu.org, Klaus Stehle <klaus.stehle <at> uni-tuebingen.de>
Subject: Re: bug#11197: problems with string ports and unicode
Date: Wed, 11 Apr 2012 18:25:10 +0200
Hi Mark,

Mark H Weaver <mhw <at> netris.org> skribis:

> ludo <at> gnu.org (Ludovic Courtès) writes:
>> It may be that your string ports are created with a non-Unicode-capable
>> encoding.  Try something like:
>>
>>   (define p
>>     (with-fluids ((%default-port-encoding "UTF-8"))
>>       (open-input-string "čtyří")))
>
> IMO, this should not be needed.  Port encodings should only be relevant
> when reading from ports involving byte strings, such as file ports or
> socket ports.  The encoding used by Scheme strings is a purely internal
> matter; from the user's perspective, Scheme strings are simply a
> sequence of Unicode code points.

Note that “UTF-8” above has nothing to do with Guile’s internal string
representation; it’s just one of the many encodings that can represent
“čtyří”.

> What _is_ needed is a file coding declaration near the top of the source
> file, e.g. "coding: utf-8" (see "Character Encoding of Source Files" in
> the manual).

Yes.  And you actually need both–i.e., the ‘coding’ cookie won’t
magically make string ports use that encoding.

> I tried that and it still fails for me.

What fails exactly?

Thanks,
Ludo’.




Information forwarded to bug-guile <at> gnu.org:
bug#11197; Package guile. (Wed, 11 Apr 2012 17:58:02 GMT) Full text and rfc822 format available.

Message #17 received at 11197 <at> debbugs.gnu.org (full text, mbox):

From: Mark H Weaver <mhw <at> netris.org>
To: ludo <at> gnu.org (Ludovic Courtès)
Cc: 11197 <at> debbugs.gnu.org, Klaus Stehle <klaus.stehle <at> uni-tuebingen.de>
Subject: Re: bug#11197: problems with string ports and unicode
Date: Wed, 11 Apr 2012 13:53:21 -0400
Hi Ludovic,

ludo <at> gnu.org (Ludovic Courtès) writes:
> Mark H Weaver <mhw <at> netris.org> skribis:
>> ludo <at> gnu.org (Ludovic Courtès) writes:
>>> It may be that your string ports are created with a non-Unicode-capable
>>> encoding.  Try something like:
>>>
>>>   (define p
>>>     (with-fluids ((%default-port-encoding "UTF-8"))
>>>       (open-input-string "čtyří")))
>>
>> IMO, this should not be needed.  Port encodings should only be relevant
>> when reading from ports involving byte strings, such as file ports or
>> socket ports.  The encoding used by Scheme strings is a purely internal
>> matter; from the user's perspective, Scheme strings are simply a
>> sequence of Unicode code points.
>
> Note that “UTF-8” above has nothing to do with Guile’s internal string
> representation; it’s just one of the many encodings that can represent
> “čtyří”.

Okay, now I understand.  The problem is that internally, string ports
are implemented by converting the string into a stream of bytes in the
string port's encoding, and then the string port reads those bytes.

Nonetheless, it is very unfortunate that this internal implementation
detail "leaks" out into user code.  SRFI-6 says nothing about port
encodings, and portable code written for SRFI-6 will fail on Guile
unless the string is constrained to whatever the default port encoding
happens to be.

Conceptually, a string port is a textual port, not a binary port.  You
should be able to hand it an arbitrary string and read those characters
from it, as described in SRFI-6, without setting Guile-specific fluid
variables.  Similarly, you should be able to write arbitrary characters
to a string-output-port.

IMO, string ports should use UTF-8 as their initial port encoding, since
we know that UTF-8 can represent any Guile string.  This will allow
portable use of string ports.

I realize that this would change the existing behavior of programs that
use binary I/O on string ports, but as things stand right now, portable
SRFI-6 code is broken on Guile.

What do you think?

>> What _is_ needed is a file coding declaration near the top of the source
>> file, e.g. "coding: utf-8" (see "Character Encoding of Source Files" in
>> the manual).
>
> Yes.  And you actually need both–i.e., the ‘coding’ cookie won’t
> magically make string ports use that encoding.
>
>> I tried that and it still fails for me.
>
> What fails exactly?

It fails ungracefully (goes into an infinite while trying to print the
backtrace) without the %default-port-encoding setting.  It works when I
add both the %default-port-encoding setting and the coding declaration.

     Thanks,
       Mark




Information forwarded to bug-guile <at> gnu.org:
bug#11197; Package guile. (Wed, 11 Apr 2012 21:03:02 GMT) Full text and rfc822 format available.

Message #20 received at 11197 <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: Mark H Weaver <mhw <at> netris.org>
Cc: 11197 <at> debbugs.gnu.org, Klaus Stehle <klaus.stehle <at> uni-tuebingen.de>
Subject: Re: bug#11197: problems with string ports and unicode
Date: Wed, 11 Apr 2012 23:01:16 +0200
[Message part 1 (text/plain, inline)]
Hi Mark,

Mark H Weaver <mhw <at> netris.org> skribis:

> Okay, now I understand.  The problem is that internally, string ports
> are implemented by converting the string into a stream of bytes in the
> string port's encoding, and then the string port reads those bytes.

Exactly.

[...]

> Conceptually, a string port is a textual port, not a binary port.

But not in Guile, where there’s no distinction between textual and
binary ports.  One can write code like:

  scheme@(guile-user)> (define (string->utf16 s)
                         (let ((p (with-fluids ((%default-port-encoding "UTF-16BE"))
                                    (open-input-string s))))
                           (get-bytevector-all p)))
  scheme@(guile-user)> (string->utf16 "hello")
  $4 = #vu8(0 104 0 101 0 108 0 108 0 111)
  scheme@(guile-user)> (use-modules(rnrs bytevectors))
  scheme@(guile-user)> (utf16->string $4)
  $5 = "hello"

> You should be able to hand it an arbitrary string and read those
> characters from it, as described in SRFI-6, without setting
> Guile-specific fluid variables.  Similarly, you should be able to
> write arbitrary characters to a string-output-port.

The SRFI-6 issue could be addressed with:

[Message part 2 (text/x-patch, inline)]
diff --git a/module/srfi/srfi-6.scm b/module/srfi/srfi-6.scm
index 098b586..ba946ec 100644
--- a/module/srfi/srfi-6.scm
+++ b/module/srfi/srfi-6.scm
@@ -1,6 +1,6 @@
 ;;; srfi-6.scm --- Basic String Ports
 
-;; 	Copyright (C) 2001, 2002, 2003, 2006 Free Software Foundation, Inc.
+;; 	Copyright (C) 2001, 2002, 2003, 2006, 2012 Free Software Foundation, Inc.
 ;;
 ;; This library is free software; you can redistribute it and/or
 ;; modify it under the terms of the GNU Lesser General Public
@@ -23,10 +23,16 @@
 ;;; Code:
 
 (define-module (srfi srfi-6)
-  #:re-export (open-input-string open-output-string get-output-string))
+  #:export (open-input-string open-output-string)
+  #:re-export (get-output-string))
 
-;; Currently, guile provides these functions by default, so no action
-;; is needed, and this file is just a placeholder.
+(define (open-input-string s)
+  (with-fluids ((%default-port-encoding "UTF-8"))
+    ((@ (guile) open-input-string) s)))
+
+(define (open-output-string)
+  (with-fluids ((%default-port-encoding "UTF-8"))
+    ((@ (guile) open-output-string))))
 
 (cond-expand-provide (current-module) '(srfi-6))
[Message part 3 (text/plain, inline)]
It wouldn’t completely solve the problem.

> IMO, string ports should use UTF-8 as their initial port encoding, since
> we know that UTF-8 can represent any Guile string.  This will allow
> portable use of string ports.

The change was submitted and briefly discussed at
<http://thread.gmane.org/gmane.lisp.guile.devel/9822>.

I think the rationale was mostly backward compatibility (in 1.8 people
could mix Latin-1 textual and binary I/O), consistency with how other
ports behave, and the ability to change the default encoding of string
ports.

> I realize that this would change the existing behavior of programs that
> use binary I/O on string ports, but as things stand right now, portable
> SRFI-6 code is broken on Guile.
>
> What do you think?

In hindsight, UTF-8 does seem like a better default than the locale port
encoding (which is what %default-port-encoding is, by default), but it
does remain useful to specify a different encoding.

>>> What _is_ needed is a file coding declaration near the top of the source
>>> file, e.g. "coding: utf-8" (see "Character Encoding of Source Files" in
>>> the manual).
>>
>> Yes.  And you actually need both–i.e., the ‘coding’ cookie won’t
>> magically make string ports use that encoding.
>>
>>> I tried that and it still fails for me.
>>
>> What fails exactly?
>
> It fails ungracefully (goes into an infinite while trying to print the
> backtrace) without the %default-port-encoding setting.

Indeed, it’s stuck in a deadlock:

--8<---------------cut here---------------start------------->8---
(gdb) bt
#0  0x00007ffff75e1204 in __lll_lock_wait () from /nix/store/vxycd107wjbhcj720hzkw2px7s7kr724-glibc-2.12.2/lib/libpthread.so.0
#1  0x00007ffff75dc4d4 in _L_lock_999 () from /nix/store/vxycd107wjbhcj720hzkw2px7s7kr724-glibc-2.12.2/lib/libpthread.so.0
#2  0x00007ffff75dc2ea in pthread_mutex_lock () from /nix/store/vxycd107wjbhcj720hzkw2px7s7kr724-glibc-2.12.2/lib/libpthread.so.0
#3  0x00007ffff7b30499 in scm_dynwind_pthread_mutex_lock (mutex=0x7ffff7dd28c0) at threads.c:1962
#4  0x00007ffff7b2bb0e in scm_mkstrport (pos=0x2, str=0x4, modes=327680, caller=<value optimized out>) at strports.c:287
#5  0x00007ffff7aac20b in display_backtrace_body (a=0x7fffffffc1a0) at backtrace.c:487
#6  0x00007ffff7b46c7b in vm_regular_engine (vm=0x6f61f0, program=0x7f5d50, argv=0x6fa3b0, nargs=-1) at vm-i-system.c:895
#7  0x00007ffff7ac039e in scm_call_3 (proc=0x7f5d50, arg1=<value optimized out>, arg2=<value optimized out>, arg3=<value optimized out>) at eval.c:500
#8  0x00007ffff7b32504 in scm_internal_catch (tag=<value optimized out>, body=<value optimized out>, body_data=<value optimized out>, handler=<value optimized out>, handler_data=<value optimized out>) at throw.c:222
#9  0x00007ffff7aabbba in scm_display_backtrace_with_highlights (stack=<value optimized out>, port=<value optimized out>, first=<value optimized out>, depth=<value optimized out>, highlights=<value optimized out>)
    at backtrace.c:558
#10 0x00007ffff7ab725e in print_exception_and_backtrace (error_port=0x6f6170, tag=0x66d4c0, args=0x8e6ea0) at continuations.c:490
#11 pre_unwind_handler (error_port=0x6f6170, tag=0x66d4c0, args=0x8e6ea0) at continuations.c:534
#12 0x00007ffff7b46c7b in vm_regular_engine (vm=0x6f61f0, program=0x7f3ce0, argv=0x6fa300, nargs=-1) at vm-i-system.c:895
#13 0x00007ffff7b4846e in scm_call_with_vm (vm=0x6f61f0, proc=0x7f3ce0, args=<value optimized out>) at vm.c:878
#14 0x00007ffff7b296db in scm_to_stringn (str=0x8dba80, lenp=0x7fffffffc4e8, encoding=<value optimized out>, handler=SCM_FAILED_CONVERSION_ERROR) at strings.c:2102
#15 0x00007ffff7b2bb73 in scm_mkstrport (pos=0x2, str=0x8dba80, modes=196608, caller=<value optimized out>) at strports.c:312
--8<---------------cut here---------------end--------------->8---

This could be fixed by calling ‘scm_new_port_table_entry’ after having
prepared the backing buffer, but the problem is that ‘pt->encoding’ is
needed before.

Thoughts?

Ludo’.

Information forwarded to bug-guile <at> gnu.org:
bug#11197; Package guile. (Wed, 20 Jun 2012 21:03:02 GMT) Full text and rfc822 format available.

Message #23 received at 11197 <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: Mark H Weaver <mhw <at> netris.org>
Cc: 11197 <at> debbugs.gnu.org, Klaus Stehle <klaus.stehle <at> uni-tuebingen.de>
Subject: Re: bug#11197: problems with string ports and unicode
Date: Wed, 20 Jun 2012 22:58:39 +0200
Hi,

ludo <at> gnu.org (Ludovic Courtès) skribis:

> @@ -23,10 +23,16 @@
>  ;;; Code:
>  
>  (define-module (srfi srfi-6)
> -  #:re-export (open-input-string open-output-string get-output-string))
> +  #:export (open-input-string open-output-string)
> +  #:re-export (get-output-string))
>  
> -;; Currently, guile provides these functions by default, so no action
> -;; is needed, and this file is just a placeholder.
> +(define (open-input-string s)
> +  (with-fluids ((%default-port-encoding "UTF-8"))
> +    ((@ (guile) open-input-string) s)))
> +
> +(define (open-output-string)
> +  (with-fluids ((%default-port-encoding "UTF-8"))
> +    ((@ (guile) open-output-string))))

I’ve applied it as commit ecb48dccbac6b8fdd969f50a23351ef7f4b91ce5.

Thanks,
Ludo’.




Reply sent to ludo <at> gnu.org (Ludovic Courtès):
You have taken responsibility. (Wed, 20 Jun 2012 21:07:02 GMT) Full text and rfc822 format available.

Notification sent to Klaus Stehle <klaus.stehle <at> uni-tuebingen.de>:
bug acknowledged by developer. (Wed, 20 Jun 2012 21:07:02 GMT) Full text and rfc822 format available.

Message #28 received at 11197-done <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: Mark H Weaver <mhw <at> netris.org>
Cc: 11197-done <at> debbugs.gnu.org, Klaus Stehle <klaus.stehle <at> uni-tuebingen.de>
Subject: Re: bug#11197: problems with string ports and unicode
Date: Wed, 20 Jun 2012 23:03:02 +0200
Hi,

ludo <at> gnu.org (Ludovic Courtès) skribis:

> Indeed, it’s stuck in a deadlock:
>
> (gdb) bt
> #0  0x00007ffff75e1204 in __lll_lock_wait () from /nix/store/vxycd107wjbhcj720hzkw2px7s7kr724-glibc-2.12.2/lib/libpthread.so.0
> #1  0x00007ffff75dc4d4 in _L_lock_999 () from /nix/store/vxycd107wjbhcj720hzkw2px7s7kr724-glibc-2.12.2/lib/libpthread.so.0
> #2  0x00007ffff75dc2ea in pthread_mutex_lock () from /nix/store/vxycd107wjbhcj720hzkw2px7s7kr724-glibc-2.12.2/lib/libpthread.so.0
> #3  0x00007ffff7b30499 in scm_dynwind_pthread_mutex_lock (mutex=0x7ffff7dd28c0) at threads.c:1962
> #4  0x00007ffff7b2bb0e in scm_mkstrport (pos=0x2, str=0x4, modes=327680, caller=<value optimized out>) at strports.c:287
> #5  0x00007ffff7aac20b in display_backtrace_body (a=0x7fffffffc1a0) at backtrace.c:487
> #6  0x00007ffff7b46c7b in vm_regular_engine (vm=0x6f61f0, program=0x7f5d50, argv=0x6fa3b0, nargs=-1) at vm-i-system.c:895
> #7  0x00007ffff7ac039e in scm_call_3 (proc=0x7f5d50, arg1=<value optimized out>, arg2=<value optimized out>, arg3=<value optimized out>) at eval.c:500
> #8  0x00007ffff7b32504 in scm_internal_catch (tag=<value optimized out>, body=<value optimized out>, body_data=<value optimized out>, handler=<value optimized out>, handler_data=<value optimized out>) at throw.c:222
> #9  0x00007ffff7aabbba in scm_display_backtrace_with_highlights (stack=<value optimized out>, port=<value optimized out>, first=<value optimized out>, depth=<value optimized out>, highlights=<value optimized out>)
>     at backtrace.c:558
> #10 0x00007ffff7ab725e in print_exception_and_backtrace (error_port=0x6f6170, tag=0x66d4c0, args=0x8e6ea0) at continuations.c:490
> #11 pre_unwind_handler (error_port=0x6f6170, tag=0x66d4c0, args=0x8e6ea0) at continuations.c:534
> #12 0x00007ffff7b46c7b in vm_regular_engine (vm=0x6f61f0, program=0x7f3ce0, argv=0x6fa300, nargs=-1) at vm-i-system.c:895
> #13 0x00007ffff7b4846e in scm_call_with_vm (vm=0x6f61f0, proc=0x7f3ce0, args=<value optimized out>) at vm.c:878
> #14 0x00007ffff7b296db in scm_to_stringn (str=0x8dba80, lenp=0x7fffffffc4e8, encoding=<value optimized out>, handler=SCM_FAILED_CONVERSION_ERROR) at strings.c:2102
> #15 0x00007ffff7b2bb73 in scm_mkstrport (pos=0x2, str=0x8dba80, modes=196608, caller=<value optimized out>) at strports.c:312
>
> This could be fixed by calling ‘scm_new_port_table_entry’ after having
> prepared the backing buffer, but the problem is that ‘pt->encoding’ is
> needed before.

Fixed in 03fcf93bff9f02a3d12ab86be4e67b996310aad4 (not particularly
elegant, but I couldn’t think of a better way.)  The test in that commit
captures the initial problem.

I’m marking this bug as “done”.  If you would like to discuss string
port encodings, separate binary/textual ports, or any other significant
change, you’re welcome to do so on guile-devel <at> gnu.org, of course.

Thanks!

Ludo’.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 19 Jul 2012 11:24:08 GMT) Full text and rfc822 format available.

This bug report was last modified 11 years and 277 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.