GNU bug report logs - #35920
strftime incorrectly assumes that nstrftime will produce UTF-8

Package: guile;

Reported by: Mark H Weaver <mhw <at> netris.org>

Date: Sun, 26 May 2019 20:45:02 UTC

Severity: normal

To reply to this bug, email your comments to 35920 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-guile <at> gnu.org:
bug#35920; Package guile. (Sun, 26 May 2019 20:45:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Mark H Weaver <mhw <at> netris.org>:
New bug report received and forwarded. Copy sent to bug-guile <at> gnu.org. (Sun, 26 May 2019 20:45:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Mark H Weaver <mhw <at> netris.org>
To: Christopher Lam <christopher.lck <at> gmail.com>
Cc: bug-guile <at> gnu.org
Subject: strftime incorrectly assumes that nstrftime will produce UTF-8
Date: Sun, 26 May 2019 16:41:57 -0400

Hi Christopher,

Christopher Lam <christopher.lck <at> gmail.com> writes:

> Addendum - wish to confirm if guile bug (guile-2.2 on Windows):
> - set locale to non-Anglo so that (setlocale LC_ALL) returns
> "French_France.1252"
> - call (strftime "%B" 4000000) - that's 4x10^6 -- this should return
> "février 1970"
>
> but the following error arises:
> Throw to key `decoding-error' with args `("scm_from_utf8_stringn" "input
> locale conversion error" 0 #vu8(102 233 118 114 105 101 114 32 49 57 55
> 48))'.
>
> Is this a bug?

Yes.  Guile's 'strftime' procedure currently assumes that the underlying
'nstrftime' C function (from Gnulib) will produce output in UTF-8,
although it almost certainly produces output in the locale encoding.
Indeed, the bytevector #vu8(102 233 118 114 105 101 114 32 49 57 55 48)
represents the characters "février 1970" in Windows-1252 encoding.

I'm CC'ing this reply to <bug-guile <at> gnu.org>, so that a bug ticket will
be created.  In the future, that's the preferred address for sending bug
reports.

Anyway, thanks for letting us know about this.  I'll work on it soon.

      Mark

Information forwarded to bug-guile <at> gnu.org:
bug#35920; Package guile. (Sun, 26 May 2019 20:56:01 GMT) Full text and rfc822 format available.

Message #8 received at 35920 <at> debbugs.gnu.org (full text, mbox):

From: Mark H Weaver <mhw <at> netris.org>
To: Christopher Lam <christopher.lck <at> gmail.com>
Cc: 35920 <at> debbugs.gnu.org
Subject: Re: bug#35920: strftime incorrectly assumes that nstrftime will
 produce UTF-8
Date: Sun, 26 May 2019 16:53:08 -0400

There might also be related problems with 'strptime'.  These problems
date back to when Guile was first extended to support non-ASCII strings.
Here's the relevant commit in 2009 that added non-ASCII support to
'strftime' and 'strptime', but did so imperfectly:
587a33556fdef90025c1b7d4d172af649c8ebba8

       Mark

Information forwarded to bug-guile <at> gnu.org:
bug#35920; Package guile. (Sun, 26 May 2019 21:51:02 GMT) Full text and rfc822 format available.

Message #11 received at 35920 <at> debbugs.gnu.org (full text, mbox):

From: Mark H Weaver <mhw <at> netris.org>
To: Christopher Lam <christopher.lck <at> gmail.com>
Cc: 35920 <at> debbugs.gnu.org
Subject: Re: bug#35920: strftime incorrectly assumes that nstrftime will
 produce UTF-8
Date: Sun, 26 May 2019 17:48:27 -0400

Here's a patch that might fix the problem, but I don't have time to test
it right now.

       Mark


--8<---------------cut here---------------start------------->8---
diff --git a/libguile/stime.c b/libguile/stime.c
index b681d7ee3..9a21b61fe 100644
--- a/libguile/stime.c
+++ b/libguile/stime.c
@@ -662,9 +662,9 @@ SCM_DEFINE (scm_strftime, "strftime", 2, 0, 0,
   SCM_VALIDATE_STRING (1, format);
   bdtime2c (stime, &t, SCM_ARG2, FUNC_NAME);
 
-  /* Convert string to UTF-8 so that non-ASCII characters in the
-     format are passed through unchanged.  */
-  fmt = scm_to_utf8_stringn (format, &len);
+  /* Convert the format string to the locale encoding, as the underlying
+     'strftime' C function expects.  */
+  fmt = scm_to_locale_stringn (format, &len);
 
   /* Ugly hack: strftime can return 0 if its buffer is too small,
      but some valid time strings (e.g. "%p") can sometimes produce
@@ -727,7 +727,7 @@ SCM_DEFINE (scm_strftime, "strftime", 2, 0, 0,
 #endif
     }
 
-  result = scm_from_utf8_string (tbuf + 1);
+  result = scm_from_locale_string (tbuf + 1);
   free (tbuf);
   free (myfmt);
 #if HAVE_STRUCT_TM_TM_ZONE
@@ -754,16 +754,16 @@ SCM_DEFINE (scm_strptime, "strptime", 2, 0, 0,
 {
   struct tm t;
   char *fmt, *str, *rest;
-  size_t used_len;
+  SCM used_len;
   long zoff;
 
   SCM_VALIDATE_STRING (1, format);
   SCM_VALIDATE_STRING (2, string);
 
-  /* Convert strings to UTF-8 so that non-ASCII characters are passed
-     through unchanged.  */
-  fmt = scm_to_utf8_string (format);
-  str = scm_to_utf8_string (string);
+  /* Convert strings to the locale encoding, as the underlying
+     'strptime' C function expects.  */
+  fmt = scm_to_locale_string (format);
+  str = scm_to_locale_string (string);
 
   /* initialize the struct tm */
 #define tm_init(field) t.field = 0
@@ -807,14 +807,14 @@ SCM_DEFINE (scm_strptime, "strptime", 2, 0, 0,
   zoff = 0;
 #endif
 
-  /* Compute the number of UTF-8 characters.  */
-  used_len = u8_strnlen ((scm_t_uint8*) str, rest-str);
+  /* Compute the number of characters parsed.  */
+  used_len = scm_string_length (scm_from_locale_stringn (str, rest-str));
   scm_remember_upto_here_2 (format, string);
   free (str);
   free (fmt);
 
   return scm_cons (filltime (&t, zoff, NULL),
-		   scm_from_signed_integer (used_len));
+                   used_len);
 }
 #undef FUNC_NAME
 #endif /* HAVE_STRPTIME */
--8<---------------cut here---------------end--------------->8---

Information forwarded to bug-guile <at> gnu.org:
bug#35920; Package guile. (Mon, 27 May 2019 00:30:02 GMT) Full text and rfc822 format available.

Message #14 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Christopher Lam <christopher.lck <at> gmail.com>
To: Mark H Weaver <mhw <at> netris.org>
Cc: bug-guile <at> gnu.org
Subject: Re: strftime incorrectly assumes that nstrftime will produce UTF-8
Date: Mon, 27 May 2019 10:04:39 +1000

[Message part 1 (text/plain, inline)]

Thanks! I'm glad to know this. I have adequate fluency in guile now but
very basic C hence some bugs are very opaque to me.

On Mon., 27 May 2019, 04:43 Mark H Weaver, <mhw <at> netris.org> wrote:

> Hi Christopher,
>
> Christopher Lam <christopher.lck <at> gmail.com> writes:
>
> > Addendum - wish to confirm if guile bug (guile-2.2 on Windows):
> > - set locale to non-Anglo so that (setlocale LC_ALL) returns
> > "French_France.1252"
> > - call (strftime "%B" 4000000) - that's 4x10^6 -- this should return
> > "février 1970"
> >
> > but the following error arises:
> > Throw to key `decoding-error' with args `("scm_from_utf8_stringn" "input
> > locale conversion error" 0 #vu8(102 233 118 114 105 101 114 32 49 57 55
> > 48))'.
> >
> > Is this a bug?
>
> Yes.  Guile's 'strftime' procedure currently assumes that the underlying
> 'nstrftime' C function (from Gnulib) will produce output in UTF-8,
> although it almost certainly produces output in the locale encoding.
> Indeed, the bytevector #vu8(102 233 118 114 105 101 114 32 49 57 55 48)
> represents the characters "février 1970" in Windows-1252 encoding.
>
> I'm CC'ing this reply to <bug-guile <at> gnu.org>, so that a bug ticket will
> be created.  In the future, that's the preferred address for sending bug
> reports.
>
> Anyway, thanks for letting us know about this.  I'll work on it soon.
>
>       Mark
>

[Message part 2 (text/html, inline)]

Reply sent to Ludovic Courtès <ludo <at> gnu.org>:
You have taken responsibility. (Sun, 30 Jun 2019 19:52:02 GMT) Full text and rfc822 format available.

Notification sent to Mark H Weaver <mhw <at> netris.org>:
bug acknowledged by developer. (Sun, 30 Jun 2019 19:52:02 GMT) Full text and rfc822 format available.

Message #19 received at 35920-done <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Mark H Weaver <mhw <at> netris.org>
Cc: 35920-done <at> debbugs.gnu.org, Christopher Lam <christopher.lck <at> gmail.com>
Subject: Re: bug#35920: strftime incorrectly assumes that nstrftime will
 produce UTF-8
Date: Sun, 30 Jun 2019 21:51:42 +0200

Hi Mark,

Mark H Weaver <mhw <at> netris.org> skribis:

> Here's a patch that might fix the problem, but I don't have time to test
> it right now.

It works! :-)  I wrote tests and pushed it as
ab2fd70ef1e36c6532128b73082809ef3c056556.

I forgot to change the commit author to you before pushing, apologies!

Thanks,
Ludo’.

Information forwarded to bug-guile <at> gnu.org:
bug#35920; Package guile. (Sun, 30 Jun 2019 21:14:01 GMT) Full text and rfc822 format available.

Message #22 received at 35920 <at> debbugs.gnu.org (full text, mbox):

From: Mark H Weaver <mhw <at> netris.org>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 35920 <at> debbugs.gnu.org, Christopher Lam <christopher.lck <at> gmail.com>
Subject: Re: bug#35920: strftime incorrectly assumes that nstrftime will
 produce UTF-8
Date: Sun, 30 Jun 2019 17:12:45 -0400

reopen 35920
thanks

Hi Ludovic,

> Mark H Weaver <mhw <at> netris.org> skribis:
>
>> Here's a patch that might fix the problem, but I don't have time to test
>> it right now.
>
> It works! :-)  I wrote tests and pushed it as
> ab2fd70ef1e36c6532128b73082809ef3c056556.

On my system, I found that my proposed patch caused one of the existing
tests to fail.  The problem is that if the format string includes
characters that are not representable in the current locale encoding, it
will fail.  It seems to me that this could break existing code that
currently works.  User code that uses 'strftime' might never encode the
resulting string in the locale encoding.

I was planning to rewrite the code to scan for the '%' escapes
ourselves, to call 'strftime' for each escape sequence (without
including the surrounding text), and to concatenate the results.

> I forgot to change the commit author to you before pushing, apologies!

No worries.  Thanks for working on it.

      Mark

Did not alter fixed versions and reopened. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 30 Jun 2019 21:14:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-guile <at> gnu.org:
bug#35920; Package guile. (Sun, 30 Jun 2019 22:38:01 GMT) Full text and rfc822 format available.

Message #27 received at 35920 <at> debbugs.gnu.org (full text, mbox):

From: John Cowan <cowan <at> ccil.org>
To: Mark H Weaver <mhw <at> netris.org>
Cc: 35920 <at> debbugs.gnu.org, Christopher Lam <christopher.lck <at> gmail.com>,
 Ludovic Courtès <ludo <at> gnu.org>
Subject: Re: bug#35920: strftime incorrectly assumes that nstrftime will
 produce UTF-8
Date: Sun, 30 Jun 2019 18:37:26 -0400

[Message part 1 (text/plain, inline)]

That's a mug's game: I've been there and tried it (not in Scheme). I
recommend writing a strftime in Scheme from scratch.  It's not that hard;
the most annoying thing is getting into the locale files to handle the
locale-sensitive directives (month name, weekday name, AM/PM, and the
ordering of dates).


On Sun, Jun 30, 2019 at 5:14 PM Mark H Weaver <mhw <at> netris.org> wrote:

> reopen 35920
> thanks
>
> Hi Ludovic,
>
> > Mark H Weaver <mhw <at> netris.org> skribis:
> >
> >> Here's a patch that might fix the problem, but I don't have time to test
> >> it right now.
> >
> > It works! :-)  I wrote tests and pushed it as
> > ab2fd70ef1e36c6532128b73082809ef3c056556.
>
> On my system, I found that my proposed patch caused one of the existing
> tests to fail.  The problem is that if the format string includes
> characters that are not representable in the current locale encoding, it
> will fail.  It seems to me that this could break existing code that
> currently works.  User code that uses 'strftime' might never encode the
> resulting string in the locale encoding.
>
> I was planning to rewrite the code to scan for the '%' escapes
> ourselves, to call 'strftime' for each escape sequence (without
> including the surrounding text), and to concatenate the results.
>
> > I forgot to change the commit author to you before pushing, apologies!
>
> No worries.  Thanks for working on it.
>
>       Mark
>
>
>
>

[Message part 2 (text/html, inline)]

Information forwarded to bug-guile <at> gnu.org:
bug#35920; Package guile. (Sun, 30 Jun 2019 23:07:02 GMT) Full text and rfc822 format available.

Message #30 received at 35920 <at> debbugs.gnu.org (full text, mbox):

From: Mark H Weaver <mhw <at> netris.org>
To: John Cowan <cowan <at> ccil.org>
Cc: 35920 <at> debbugs.gnu.org, Christopher Lam <christopher.lck <at> gmail.com>,
 Ludovic Courtès <ludo <at> gnu.org>
Subject: Re: bug#35920: strftime incorrectly assumes that nstrftime will
 produce UTF-8
Date: Sun, 30 Jun 2019 19:06:28 -0400

Hi John,

John Cowan <cowan <at> ccil.org> writes:

> That's a mug's game: I've been there and tried it (not in Scheme). I
> recommend writing a strftime in Scheme from scratch.  It's not that
> hard; the most annoying thing is getting into the locale files to
> handle the locale-sensitive directives (month name, weekday name,
> AM/PM, and the ordering of dates).

Is there a portable way to find the relevant locale files and interpret
them, on both POSIX and Windows systems?  If so, can you point out the
relevant documentation?

      Thanks,
        Mark

Information forwarded to bug-guile <at> gnu.org:
bug#35920; Package guile. (Mon, 01 Jul 2019 01:29:01 GMT) Full text and rfc822 format available.

Message #33 received at 35920 <at> debbugs.gnu.org (full text, mbox):

From: John Cowan <cowan <at> ccil.org>
To: Mark H Weaver <mhw <at> netris.org>
Cc: 35920 <at> debbugs.gnu.org, Christopher Lam <christopher.lck <at> gmail.com>,
 Ludovic Courtès <ludo <at> gnu.org>
Subject: Re: bug#35920: strftime incorrectly assumes that nstrftime will
 produce UTF-8
Date: Sun, 30 Jun 2019 21:28:18 -0400

[Message part 1 (text/plain, inline)]

On Sun, Jun 30, 2019 at 7:06 PM Mark H Weaver <mhw <at> netris.org> wrote:

Is there a portable way to find the relevant locale files and interpret
> them, on both POSIX and Windows systems?  If so, can you point out the
> relevant documentation?
>

Portable in the sense that the information can be obtained on both Posix
and Windows, but not with exactly the same code.

On Posix, you need the nl_langinfo() and nl_langinfo_l() functions from
<langinfo.h>.  These functions are documented at <
http://pubs.opengroup.org/onlinepubs/9699919799/functions/nl_langinfo.html>,
and the constants d at <
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/langinfo.h.html>.

On Windows, you need to call EnumCalendarInfoExEx if you have dropped
support for Vista and earlier versions, or if not, then follow the links
from the page about it.  The function is documented at <
https://docs.microsoft.com/en-us/windows/desktop/api/Winnls/nf-winnls-enumcalendarinfoexex>,
and the constants that specify particular pieces of information at <
https://docs.microsoft.com/en-us/windows/desktop/Intl/calendar-type-information>.
(I have never used these interfaces myself.)

I hope this is helpful.

John Cowan          http://vrici.lojban.org/~cowan        cowan <at> ccil.org
Eric Raymond is the Margaret Mead of the Open Source movement.
          --Bruce Perens, a long time ago

[Message part 2 (text/html, inline)]

Information forwarded to bug-guile <at> gnu.org:
bug#35920; Package guile. (Tue, 02 Jul 2019 08:59:01 GMT) Full text and rfc822 format available.

Message #36 received at 35920 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Mark H Weaver <mhw <at> netris.org>
Cc: 35920 <at> debbugs.gnu.org, Christopher Lam <christopher.lck <at> gmail.com>,
 John Cowan <cowan <at> ccil.org>
Subject: Re: bug#35920: strftime incorrectly assumes that nstrftime will
 produce UTF-8
Date: Tue, 02 Jul 2019 10:58:32 +0200

Hi,

Mark H Weaver <mhw <at> netris.org> skribis:

> John Cowan <cowan <at> ccil.org> writes:
>
>> That's a mug's game: I've been there and tried it (not in Scheme). I
>> recommend writing a strftime in Scheme from scratch.  It's not that
>> hard; the most annoying thing is getting into the locale files to
>> handle the locale-sensitive directives (month name, weekday name,
>> AM/PM, and the ordering of dates).
>
> Is there a portable way to find the relevant locale files and interpret
> them, on both POSIX and Windows systems?  If so, can you point out the
> relevant documentation?

The (ice-9 i18n) module provides bindings to nl_langinfo et al.  The
actual data format is specific to the C library, so I think we cannot
portably go deeper than what (ice-9 i18n) does.

Ludo’.

Information forwarded to bug-guile <at> gnu.org:
bug#35920; Package guile. (Tue, 02 Jul 2019 09:08:01 GMT) Full text and rfc822 format available.

Message #39 received at 35920 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Mark H Weaver <mhw <at> netris.org>
Cc: 35920 <at> debbugs.gnu.org, Christopher Lam <christopher.lck <at> gmail.com>
Subject: Re: bug#35920: strftime incorrectly assumes that nstrftime will
 produce UTF-8
Date: Tue, 02 Jul 2019 11:07:01 +0200

Hi Mark,

Mark H Weaver <mhw <at> netris.org> skribis:

>> Mark H Weaver <mhw <at> netris.org> skribis:
>>
>>> Here's a patch that might fix the problem, but I don't have time to test
>>> it right now.
>>
>> It works! :-)  I wrote tests and pushed it as
>> ab2fd70ef1e36c6532128b73082809ef3c056556.
>
> On my system, I found that my proposed patch caused one of the existing
> tests to fail.

Which test?  In commit ab2fd70ef1e36c6532128b73082809ef3c056556 I
modified the test that passes \u0100 to run in a UTF-8 locale, on the
grounds that the previous behavior was fragile: “raw bytes” of the input
string would be preserved, but they could be mixed with things like
month names in the current locale encoding.  The result is rather
unpredictable.

> The problem is that if the format string includes characters that are
> not representable in the current locale encoding, it will fail.  It
> seems to me that this could break existing code that currently works.
> User code that uses 'strftime' might never encode the resulting string
> in the locale encoding.

In theory yes, but I cannot think of a scenario where the previous
behavior would be “useful”, because it’s hard to even describe what it
means.

> I was planning to rewrite the code to scan for the '%' escapes
> ourselves, to call 'strftime' for each escape sequence (without
> including the surrounding text), and to concatenate the results.

I think we should deprecate ‘strftime’ and ‘strptime’: (srfi srfi-19)
provides similar functionality, it uses (ice-9 i18n) for the locale
stuff, and it has a better API.

Perhaps something we can do in 3.0?

Thanks,
Ludo’.

Information forwarded to bug-guile <at> gnu.org:
bug#35920; Package guile. (Tue, 02 Jul 2019 16:52:02 GMT) Full text and rfc822 format available.

Message #42 received at 35920 <at> debbugs.gnu.org (full text, mbox):

From: John Cowan <cowan <at> ccil.org>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 35920 <at> debbugs.gnu.org, Mark H Weaver <mhw <at> netris.org>,
 Christopher Lam <christopher.lck <at> gmail.com>
Subject: Re: bug#35920: strftime incorrectly assumes that nstrftime will
 produce UTF-8
Date: Tue, 2 Jul 2019 12:51:18 -0400

[Message part 1 (text/plain, inline)]

On Tue, Jul 2, 2019 at 5:08 AM Ludovic Courtès <ludo <at> gnu.org> wrote:

I think we should deprecate ‘strftime’ and ‘strptime’: (srfi srfi-19)
> provides similar functionality, it uses (ice-9 i18n) for the locale
> stuff, and it has a better API.
>

Just a heads-up.  I don't consider SRFI 19 to have a very good API, and I'm
working on a pre-SRFI for dates and times.  There is an outline of it (very
subject to change) at <
https://bitbucket.org/cowan/r7rs-wg1-infra/src/default/TimeAdvancedCowan.md>.
 Note that it does not do localization except for timezones, however, so is
probably not directly relevant.  I'd appreciate review comments at
cowan <at> ccil.org anyway.  Thanks.

John Cowan          http://vrici.lojban.org/~cowan        cowan <at> ccil.org
Is a chair finely made tragic or comic? Is the portrait of Mona Lisa
good if I desire to see it? Is the bust of Sir Philip Crampton lyrical,
epical or dramatic?  If a man hacking in fury at a block of wood make
there an image of a cow, is that image a work of art? If not, why not?
                --Stephen Dedalus

[Message part 2 (text/html, inline)]

This bug report was last modified 5 years and 296 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #35920 strftime incorrectly assumes that nstrftime will produce UTF-8

GNU bug report logs - #35920
strftime incorrectly assumes that nstrftime will produce UTF-8