GNU bug report logs - #35785
'string->uri' fails in sv_SE locale

Previous Next

Package: guile;

Reported by: Einar Largenius <einar.largenius <at> gmail.com>

Date: Fri, 17 May 2019 21:21:01 UTC

Severity: important

Done: Ludovic Courtès <ludo <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 35785 in the body.
You can then email your comments to 35785 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-guix <at> gnu.org:
bug#35785; Package guix. (Fri, 17 May 2019 21:21:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Einar Largenius <einar.largenius <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-guix <at> gnu.org. (Fri, 17 May 2019 21:21:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Einar Largenius <einar.largenius <at> gmail.com>
To: bug-guix <at> gnu.org
Subject: guix won't download if locale is set to swedish
Date: Fri, 17 May 2019 22:03:53 +0200
Hello.

I just downloaded guix and installed it. In my config I have this line:

    (locale "sv_SE.utf8")

If I run 'guix pull' I get the error:

    guix pull: error: lstat: Filen eller katalogen finns inte: "ftp://sourceware.org/pub/libffi-3.2.1.tar.gz"

The part in swedish means "file or directory does not exist".

'LANG= guix pull' works without issue.




Information forwarded to bug-guix <at> gnu.org:
bug#35785; Package guix. (Sat, 18 May 2019 11:56:01 GMT) Full text and rfc822 format available.

Message #8 received at 35785 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Einar Largenius <einar.largenius <at> gmail.com>
Cc: 35785 <at> debbugs.gnu.org
Subject: Re: bug#35785: guix won't download if locale is set to swedish
Date: Sat, 18 May 2019 13:55:20 +0200
Hello Einar,

Einar Largenius <einar.largenius <at> gmail.com> skribis:

> I just downloaded guix and installed it. In my config I have this line:
>
>     (locale "sv_SE.utf8")
>
> If I run 'guix pull' I get the error:
>
>     guix pull: error: lstat: Filen eller katalogen finns inte: "ftp://sourceware.org/pub/libffi-3.2.1.tar.gz"
>
> The part in swedish means "file or directory does not exist".

Could you paste the complete output of ‘guix pull -v2’ when running
under that locale?

Thanks,
Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#35785; Package guix. (Sun, 19 May 2019 17:46:02 GMT) Full text and rfc822 format available.

Message #11 received at 35785 <at> debbugs.gnu.org (full text, mbox):

From: Einar Largenius <einar.largenius <at> gmail.com>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 35785 <at> debbugs.gnu.org
Subject: Re: bug#35785: guix won't download if locale is set to swedish
Date: Sun, 19 May 2019 19:45:11 +0200
> Could you paste the complete output of ‘guix pull -v2’ when running
> under that locale?

Yes sorry. I have not setup email yet on that system so I need to
manually transcribe any output. This should be the complete output:

    Updating channel 'guix' from Git repository at 'https://git.savannah.gnu.org/git/guix.git'...
    Building from this channel:
      guix   https://git.savannah.gnu.org/git/guix.git  f5557bd
    guix pull: error: lstat: Filen eller katalogen finns inte: "ftp://sourceware.org/pub/libffi-3.2.1.tar.gz"




Information forwarded to bug-guix <at> gnu.org:
bug#35785; Package guix. (Mon, 20 May 2019 08:21:01 GMT) Full text and rfc822 format available.

Message #14 received at 35785 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Einar Largenius <einar.largenius <at> gmail.com>
Cc: 35785 <at> debbugs.gnu.org
Subject: Re: bug#35785: guix won't download if locale is set to swedish
Date: Mon, 20 May 2019 10:20:37 +0200
Einar Largenius <einar.largenius <at> gmail.com> skribis:

>> Could you paste the complete output of ‘guix pull -v2’ when running
>> under that locale?
>
> Yes sorry. I have not setup email yet on that system so I need to
> manually transcribe any output. This should be the complete output:
>
>     Updating channel 'guix' from Git repository at 'https://git.savannah.gnu.org/git/guix.git'...
>     Building from this channel:
>       guix   https://git.savannah.gnu.org/git/guix.git  f5557bd
>     guix pull: error: lstat: Filen eller katalogen finns inte: "ftp://sourceware.org/pub/libffi-3.2.1.tar.gz"

I can reproduce it:

--8<---------------cut here---------------start------------->8---
$ export GUIX_LOCPATH=$(guix build glibc-locales)/lib/locale
$ LANGUAGE= LC_ALL=sv_SE.utf8 guix pull -p foo
Updating channel 'guix' from Git repository at 'https://git.savannah.gnu.org/git/guix.git'...
Building from this channel:
  guix      https://git.savannah.gnu.org/git/guix.git	0f469c1
guix pull: error: lstat: Filen eller katalogen finns inte: "ftp://sourceware.org/pub/libffi/libffi-3.2.1.tar.gz"
--8<---------------cut here---------------end--------------->8---

Super weird!

Investigating…

Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#35785; Package guix. (Mon, 20 May 2019 09:15:02 GMT) Full text and rfc822 format available.

Message #17 received at 35785 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Einar Largenius <einar.largenius <at> gmail.com>
Cc: 35785 <at> debbugs.gnu.org
Subject: ‘string->uri’ is locale-dependent and
 breaks in ‘sv_SE’
Date: Mon, 20 May 2019 11:14:04 +0200
Hi!

So the guts of the problem is that Guile’s ‘string->uri’ procedure
behaves incorrectly under that locale:

--8<---------------cut here---------------start------------->8---
$ export GUIX_LOCPATH=$(guix build glibc-locales)/lib/locale
$ LANGUAGE= LC_ALL=sv_SE.utf8 ./pre-inst-env guile
GNU Guile 2.2.4
Copyright (C) 1995-2017 Free Software Foundation, Inc.

Guile comes with ABSOLUTELY NO WARRANTY; for details type `,show w'.
This program is free software, and you are welcome to redistribute it
under certain conditions; type `,show c' for details.

Enter `,help' for help.
scheme@(guile-user)> ,use(web uri)
scheme@(guile-user)> (string->uri "ftp://sourceware.org/pub/libffi/libffi-3.2.1.tar.gz")
$1 = #f
--8<---------------cut here---------------end--------------->8---

More specifically, ‘parse-authority’ is failing under that locale,
because of the “w”:

--8<---------------cut here---------------start------------->8---
scheme@(guile-user)> ((@@ (web uri) parse-authority) "//sourceware.org" (const 'fail))
$5 = fail
scheme@(guile-user)> ((@@ (web uri) parse-authority) "//sourcevare.org" (const 'fail))
$6 = #f
$7 = "sourcevare.org"
$8 = #f
--8<---------------cut here---------------end--------------->8---

We can boil it down to this example:

--8<---------------cut here---------------start------------->8---
scheme@(guile-user)> ,use(ice-9 regex)
scheme@(guile-user)> (string-match "[a-z]" "a")
$10 = #("a" (0 . 1))
scheme@(guile-user)> (string-match "[a-z]" "w")
$11 = #f
--8<---------------cut here---------------end--------------->8---

In short, under the sv_SE.utf8 locale of glibc 2.28, “w” is not
considered part of the ‘a-z’ interval.

Indeed, ‘localedata/locales/sv_SE’ in glibc reads this:

  % The letter w is normally not present in the Swedish alphabet. It
  % exists in some names in Swedish and foreign words, but is accounted
  % for as a variant of 'v'.  Words and names with 'w' are in Swedish
  % ordered alphabetically among the words and names with 'v'. If two
  % words or names are only to be distinguished by 'v' or % 'w', 'v' is
  % placed before 'w'.

Using the “lower” regexp class instead of “[a-z]” works:

--8<---------------cut here---------------start------------->8---
scheme@(guile-user)> (string-match "[[:lower:]]" "w")
$12 = #("w" (0 . 1))
--8<---------------cut here---------------end--------------->8---

However, it’s not clear to me whether the “lower” class is supposed to
be the same for all locales or if we’re just lucky:

  http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html

Thoughts?

The workaround until we’ve fixed it is to use another locale, though you
can still set “LC_MESSAGES=sv_SE.utf8” or “LANGUAGE=sv”.

Ludo’.




Changed bug title to ''string->uri' fails in sv_SE locale' from 'guix won't download if locale is set to swedish' Request was from Ludovic Courtès <ludo <at> gnu.org> to control <at> debbugs.gnu.org. (Mon, 20 May 2019 09:15:02 GMT) Full text and rfc822 format available.

Severity set to 'important' from 'normal' Request was from Ludovic Courtès <ludo <at> gnu.org> to control <at> debbugs.gnu.org. (Mon, 20 May 2019 09:17:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-guix <at> gnu.org:
bug#35785; Package guix. (Mon, 27 May 2019 11:07:01 GMT) Full text and rfc822 format available.

Message #24 received at 35785 <at> debbugs.gnu.org (full text, mbox):

From: Ricardo Wurmus <rekado <at> elephly.net>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 35785 <at> debbugs.gnu.org, Einar Largenius <einar.largenius <at> gmail.com>
Subject: Re: bug#35785: ‘string->uri’ is
 locale-dependent and breaks in ‘sv_SE’
Date: Mon, 27 May 2019 13:05:29 +0200
Ludovic Courtès <ludo <at> gnu.org> writes:

> Using the “lower” regexp class instead of “[a-z]” works:
>
> --8<---------------cut here---------------start------------->8---
> scheme@(guile-user)> (string-match "[[:lower:]]" "w")
> $12 = #("w" (0 . 1))
> --8<---------------cut here---------------end--------------->8---
>
> However, it’s not clear to me whether the “lower” class is supposed to
> be the same for all locales or if we’re just lucky:
>
>   http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html
>
> Thoughts?

The lower class is much larger than [a-z].  If we only wanted to work
around this particular problem we could explicitly spell out the range,
which would be the same in all locales.  (Obviously, that wouldn’t be
pretty.)

But can’t URI parts contain more than those characters?  To circumvent
the question whether the lower class is locale dependent we could
generate an explicit range from a charset.

--
Ricardo





Information forwarded to bug-guix <at> gnu.org:
bug#35785; Package guix. (Mon, 27 May 2019 13:40:02 GMT) Full text and rfc822 format available.

Message #27 received at 35785 <at> debbugs.gnu.org (full text, mbox):

From: Timothy Sample <samplet <at> ngyro.com>
To: Ricardo Wurmus <rekado <at> elephly.net>
Cc: 35785 <at> debbugs.gnu.org, Ludovic Courtès <ludo <at> gnu.org>,
 Einar Largenius <einar.largenius <at> gmail.com>
Subject: Re: bug#35785: ‘string->uri’ is
 locale-dependent and breaks in ‘sv_SE’
Date: Mon, 27 May 2019 09:39:03 -0400
Hello,

Ricardo Wurmus <rekado <at> elephly.net> writes:

> Ludovic Courtès <ludo <at> gnu.org> writes:
>
>> Using the “lower” regexp class instead of “[a-z]” works:
>>
>> --8<---------------cut here---------------start------------->8---
>> scheme@(guile-user)> (string-match "[[:lower:]]" "w")
>> $12 = #("w" (0 . 1))
>> --8<---------------cut here---------------end--------------->8---
>>
>> However, it’s not clear to me whether the “lower” class is supposed to
>> be the same for all locales or if we’re just lucky:
>>
>>   http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html
>>
>> Thoughts?
>
> The lower class is much larger than [a-z].  If we only wanted to work
> around this particular problem we could explicitly spell out the range,
> which would be the same in all locales.  (Obviously, that wouldn’t be
> pretty.)

I think that explicitly spelling out the range is the right thing to do
here.  The POSIX spec says that character ranges work in the POSIX
locale, but “in other locales, a range expression has unspecified
behavior.”

> But can’t URI parts contain more than those characters?

A quick reading of RFC 3986 suggests that the host part of a URI can be
an IP address (version 4 or 6) or a registered name.  It gives the
following rules for registered names:

reg-name      = *( unreserved / pct-encoded / sub-delims )
unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded   = "%" HEXDIG HEXDIG
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
              / "*" / "+" / "," / ";" / "="

Here, “ALPHA”, “DIGIT”, and “HEXDIG” are specified in RFC 2234, and are
just the ASCII ranges you might expect (except for that “HEXDIG” only
allows uppercase letters).

It looks like Guile is currently a little stricter than this, but pretty
close (if you take the character ranges to mean ASCII ranges).

> To circumvent
> the question whether the lower class is locale dependent we could
> generate an explicit range from a charset.

I think this is the right approach.  Using “[:lower:]” would allow
things outside of the RFC, like ‘é’.  Adding support for
internationalized domain names using Punycode would be cool, but well
outside the scope of this bug.  :)


-- Tim




Information forwarded to bug-guix <at> gnu.org:
bug#35785; Package guix. (Tue, 28 May 2019 11:18:01 GMT) Full text and rfc822 format available.

Message #30 received at 35785 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Timothy Sample <samplet <at> ngyro.com>
Cc: Ricardo Wurmus <rekado <at> elephly.net>, 35785 <at> debbugs.gnu.org,
 Einar Largenius <einar.largenius <at> gmail.com>
Subject: Re: bug#35785: ‘string->uri’ is
 locale-dependent and breaks in ‘sv_SE’
Date: Tue, 28 May 2019 13:17:15 +0200
Hi Timothy,

Timothy Sample <samplet <at> ngyro.com> skribis:

> A quick reading of RFC 3986 suggests that the host part of a URI can be
> an IP address (version 4 or 6) or a registered name.  It gives the
> following rules for registered names:
>
> reg-name      = *( unreserved / pct-encoded / sub-delims )
> unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
> pct-encoded   = "%" HEXDIG HEXDIG
> sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
>               / "*" / "+" / "," / ";" / "="
>
> Here, “ALPHA”, “DIGIT”, and “HEXDIG” are specified in RFC 2234, and are
> just the ASCII ranges you might expect (except for that “HEXDIG” only
> allows uppercase letters).

Do you think you could turn that into a patch for Guile?  I’d happily
apply it.  :-)

It looks like both [[:alnum:]] & co. and ranges would be
locale-dependent, so my understanding is that we’ll have to list all the
characters explicitly, right?

Thanks,
Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#35785; Package guix. (Mon, 03 Jun 2019 00:40:01 GMT) Full text and rfc822 format available.

Message #33 received at 35785 <at> debbugs.gnu.org (full text, mbox):

From: Timothy Sample <samplet <at> ngyro.com>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: Ricardo Wurmus <rekado <at> elephly.net>, 35785 <at> debbugs.gnu.org,
 Einar Largenius <einar.largenius <at> gmail.com>
Subject: Re: bug#35785: ‘string->uri’ is
 locale-dependent and breaks in ‘sv_SE’
Date: Sun, 02 Jun 2019 20:39:16 -0400
[Message part 1 (text/plain, inline)]
Hi,

Ludovic Courtès <ludo <at> gnu.org> writes:

> Hi Timothy,
>
> Timothy Sample <samplet <at> ngyro.com> skribis:
>
>> A quick reading of RFC 3986 suggests that the host part of a URI can be
>> an IP address (version 4 or 6) or a registered name.  It gives the
>> following rules for registered names:
>>
>> reg-name      = *( unreserved / pct-encoded / sub-delims )
>> unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
>> pct-encoded   = "%" HEXDIG HEXDIG
>> sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
>>               / "*" / "+" / "," / ";" / "="
>>
>> Here, “ALPHA”, “DIGIT”, and “HEXDIG” are specified in RFC 2234, and are
>> just the ASCII ranges you might expect (except for that “HEXDIG” only
>> allows uppercase letters).
>
> Do you think you could turn that into a patch for Guile?  I’d happily
> apply it.  :-)
>
> It looks like both [[:alnum:]] & co. and ranges would be
> locale-dependent, so my understanding is that we’ll have to list all the
> characters explicitly, right?

Here’s a patch for Guile that uses explicit lists of characters in the
‘(web uri)’ module instead of character ranges.  It includes two tests
that are pretty verbose, but seem to do the trick.

I have a bit more background on the problem, mostly coming from a Glibc
bug report: <https://sourceware.org/bugzilla/show_bug.cgi?id=23393>.

It turns out that it is well-known upstream, and avoiding character
ranges is the recommended approach for know.  Some other GNU tools have
adopted what is being called the “Rational Range Interpretation”
<https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html>.
AIUI, this means they use the underlying encoding numbers for ranges (I
checked the source, but I’m only mostly sure I read it right).  It looks
like the Glibc folks are unsure how to proceed on this (but are maybe
slightly leaning towards the “rational” approach).

It’s all a pretty big mess, really.  I was hoping there would be some
obvious thing that would fix the problem more generally.  Short of
pulling in the Gnulib regex code or writing something in Scheme, it
looks like Guile is stuck where it is now.

I’m unsure if the changes are considered “trivial” from a copyright
perspective.  It’s pretty close, but I think programmers tend to
underestimate here.  I’ve started the FSF copyright assignment process
either way, since is likely not my last Guile patch.  :)


-- Tim

[0001-Make-URI-handling-locale-independent.patch (text/x-patch, attachment)]

Information forwarded to bug-guix <at> gnu.org:
bug#35785; Package guix. (Mon, 03 Jun 2019 13:03:02 GMT) Full text and rfc822 format available.

Message #36 received at 35785 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Timothy Sample <samplet <at> ngyro.com>
Cc: Ricardo Wurmus <rekado <at> elephly.net>, 35785 <at> debbugs.gnu.org,
 Einar Largenius <einar.largenius <at> gmail.com>
Subject: Re: bug#35785: ‘string->uri’ is
 locale-dependent and breaks in ‘sv_SE’
Date: Mon, 03 Jun 2019 15:01:45 +0200
Hi Timothy,

Timothy Sample <samplet <at> ngyro.com> skribis:

> Here’s a patch for Guile that uses explicit lists of characters in the
> ‘(web uri)’ module instead of character ranges.  It includes two tests
> that are pretty verbose, but seem to do the trick.
>
> I have a bit more background on the problem, mostly coming from a Glibc
> bug report: <https://sourceware.org/bugzilla/show_bug.cgi?id=23393>.
>
> It turns out that it is well-known upstream, and avoiding character
> ranges is the recommended approach for know.  Some other GNU tools have
> adopted what is being called the “Rational Range Interpretation”
> <https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html>.
> AIUI, this means they use the underlying encoding numbers for ranges (I
> checked the source, but I’m only mostly sure I read it right).  It looks
> like the Glibc folks are unsure how to proceed on this (but are maybe
> slightly leaning towards the “rational” approach).

Great that you gleaned good references on this topic!

> It’s all a pretty big mess, really.  I was hoping there would be some
> obvious thing that would fix the problem more generally.  Short of
> pulling in the Gnulib regex code or writing something in Scheme, it
> looks like Guile is stuck where it is now.

Yeah.  The alternative would be to not use regexps in this context, I
guess.

> I’m unsure if the changes are considered “trivial” from a copyright
> perspective.  It’s pretty close, but I think programmers tend to
> underestimate here.  I’ve started the FSF copyright assignment process
> either way, since is likely not my last Guile patch.  :)

If the process is already underway, I think it’s fine to commit this
patch (I would rather wait if it were longer and/or if we didn’t know
each other already).

> From 7b02be4c050c7b17a0e2685e8e453295f798c360 Mon Sep 17 00:00:00 2001
> From: Timothy Sample <samplet <at> ngyro.com>
> Date: Sun, 2 Jun 2019 14:41:20 -0400
> Subject: [PATCH] Make URI handling locale independent.
>
> Fixes <https://bugs.gnu.org/35785>.
>
> * module/web/uri.scm (digits, hex-digits, letters): New variables.
> (ipv4-regexp, ipv6-regexp, domain-label-regexp, top-label-regexp,
> userinfo-pat, host-pat, ipv6-host-pat, port-pat, scheme-pat): Explicitly
> list each character instead of using character ranges.
> * test-suite/tests/web-uri.test: Add corresponding tests.

[...]

> +  (pass-if "http://www.example.com (sv_SE)"
> +    (dynamic-wind
> +      (lambda () #t)
> +      (lambda ()
> +        (with-locale "sv_SE.utf8"
> +          (reload-module (resolve-module '(web uri)))
> +          (uri=? (string->uri "http://www.example.com")
> +                 #:scheme 'http #:host "www.example.com" #:path "")))

Aren’t ‘reload-module’ calls a leftover that can now be removed (also in
the other test)?

For the sv_SE test, what about taking a host name with a ‘w’, since
that’s the use case that allowed us to uncover this bug?

Apart from that it LGTM, thank you!

Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#35785; Package guix. (Mon, 03 Jun 2019 14:25:01 GMT) Full text and rfc822 format available.

Message #39 received at 35785 <at> debbugs.gnu.org (full text, mbox):

From: Timothy Sample <samplet <at> ngyro.com>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: Ricardo Wurmus <rekado <at> elephly.net>, 35785 <at> debbugs.gnu.org,
 Einar Largenius <einar.largenius <at> gmail.com>
Subject: Re: bug#35785: ‘string->uri’ is
 locale-dependent and breaks in ‘sv_SE’
Date: Mon, 03 Jun 2019 10:24:40 -0400
Hi Ludo,

Ludovic Courtès <ludo <at> gnu.org> writes:

> Hi Timothy,
>
> Timothy Sample <samplet <at> ngyro.com> skribis:
>
>> Here’s a patch for Guile that uses explicit lists of characters in the
>> ‘(web uri)’ module instead of character ranges.  It includes two tests
>> that are pretty verbose, but seem to do the trick.
>>
>> I have a bit more background on the problem, mostly coming from a Glibc
>> bug report: <https://sourceware.org/bugzilla/show_bug.cgi?id=23393>.
>>
>> It turns out that it is well-known upstream, and avoiding character
>> ranges is the recommended approach for know.  Some other GNU tools have
>> adopted what is being called the “Rational Range Interpretation”
>> <https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html>.
>> AIUI, this means they use the underlying encoding numbers for ranges (I
>> checked the source, but I’m only mostly sure I read it right).  It looks
>> like the Glibc folks are unsure how to proceed on this (but are maybe
>> slightly leaning towards the “rational” approach).
>
> Great that you gleaned good references on this topic!
>
>> It’s all a pretty big mess, really.  I was hoping there would be some
>> obvious thing that would fix the problem more generally.  Short of
>> pulling in the Gnulib regex code or writing something in Scheme, it
>> looks like Guile is stuck where it is now.
>
> Yeah.  The alternative would be to not use regexps in this context, I
> guess.

I meant fixing regexes in other contexts, since I’m sure the URI module
is not the only Guile code ever that assumed “[a-z]” would only match
ASCII lowercase letters.

>> I’m unsure if the changes are considered “trivial” from a copyright
>> perspective.  It’s pretty close, but I think programmers tend to
>> underestimate here.  I’ve started the FSF copyright assignment process
>> either way, since is likely not my last Guile patch.  :)
>
> If the process is already underway, I think it’s fine to commit this
> patch (I would rather wait if it were longer and/or if we didn’t know
> each other already).

Sounds good!

>> From 7b02be4c050c7b17a0e2685e8e453295f798c360 Mon Sep 17 00:00:00 2001
>> From: Timothy Sample <samplet <at> ngyro.com>
>> Date: Sun, 2 Jun 2019 14:41:20 -0400
>> Subject: [PATCH] Make URI handling locale independent.
>>
>> Fixes <https://bugs.gnu.org/35785>.
>>
>> * module/web/uri.scm (digits, hex-digits, letters): New variables.
>> (ipv4-regexp, ipv6-regexp, domain-label-regexp, top-label-regexp,
>> userinfo-pat, host-pat, ipv6-host-pat, port-pat, scheme-pat): Explicitly
>> list each character instead of using character ranges.
>> * test-suite/tests/web-uri.test: Add corresponding tests.
>
> [...]
>
>> +  (pass-if "http://www.example.com (sv_SE)"
>> +    (dynamic-wind
>> +      (lambda () #t)
>> +      (lambda ()
>> +        (with-locale "sv_SE.utf8"
>> +          (reload-module (resolve-module '(web uri)))
>> +          (uri=? (string->uri "http://www.example.com")
>> +                 #:scheme 'http #:host "www.example.com" #:path "")))
>
> Aren’t ‘reload-module’ calls a leftover that can now be removed (also in
> the other test)?

I needed to reload the modules like that to make the tests fail without
the patch and pass with it.  My understanding is that the bug happens
at regex compile time, which happens when the module is loaded.  If I
don’t reload the module, the old URI code passes the tests, since the
regexes were compiled with a locale that does not trigger the bug.  It’s
a little wacky, sure, but it was the best idea I could come up with.

> For the sv_SE test, what about taking a host name with a ‘w’, since
> that’s the use case that allowed us to uncover this bug?

I thought I was being clever by using a “www” hostname, but apparently
it’s so normalized as to be invisible!  Feel free to change it to
something more obvious like “w.com” or whatever.


-- Tim




Information forwarded to bug-guix <at> gnu.org:
bug#35785; Package guix. (Tue, 04 Jun 2019 07:44:01 GMT) Full text and rfc822 format available.

Message #42 received at 35785 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Timothy Sample <samplet <at> ngyro.com>
Cc: Ricardo Wurmus <rekado <at> elephly.net>, 35785 <at> debbugs.gnu.org,
 Einar Largenius <einar.largenius <at> gmail.com>
Subject: Re: bug#35785: ‘string->uri’ is
 locale-dependent and breaks in ‘sv_SE’
Date: Tue, 04 Jun 2019 09:42:55 +0200
Hello,

Timothy Sample <samplet <at> ngyro.com> skribis:

>>> From 7b02be4c050c7b17a0e2685e8e453295f798c360 Mon Sep 17 00:00:00 2001
>>> From: Timothy Sample <samplet <at> ngyro.com>
>>> Date: Sun, 2 Jun 2019 14:41:20 -0400
>>> Subject: [PATCH] Make URI handling locale independent.
>>>
>>> Fixes <https://bugs.gnu.org/35785>.
>>>
>>> * module/web/uri.scm (digits, hex-digits, letters): New variables.
>>> (ipv4-regexp, ipv6-regexp, domain-label-regexp, top-label-regexp,
>>> userinfo-pat, host-pat, ipv6-host-pat, port-pat, scheme-pat): Explicitly
>>> list each character instead of using character ranges.
>>> * test-suite/tests/web-uri.test: Add corresponding tests.
>>
>> [...]
>>
>>> +  (pass-if "http://www.example.com (sv_SE)"
>>> +    (dynamic-wind
>>> +      (lambda () #t)
>>> +      (lambda ()
>>> +        (with-locale "sv_SE.utf8"
>>> +          (reload-module (resolve-module '(web uri)))
>>> +          (uri=? (string->uri "http://www.example.com")
>>> +                 #:scheme 'http #:host "www.example.com" #:path "")))
>>
>> Aren’t ‘reload-module’ calls a leftover that can now be removed (also in
>> the other test)?
>
> I needed to reload the modules like that to make the tests fail without
> the patch and pass with it.  My understanding is that the bug happens
> at regex compile time, which happens when the module is loaded.  If I
> don’t reload the module, the old URI code passes the tests, since the
> regexes were compiled with a locale that does not trigger the bug.  It’s
> a little wacky, sure, but it was the best idea I could come up with.

Oooh, I see.  Could you add a comment to explain this?  Then we’re done.

>> For the sv_SE test, what about taking a host name with a ‘w’, since
>> that’s the use case that allowed us to uncover this bug?
>
> I thought I was being clever by using a “www” hostname, but apparently
> it’s so normalized as to be invisible!  Feel free to change it to
> something more obvious like “w.com” or whatever.

Silly me, I guess I need new glasses.  :-)

Thanks!

Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#35785; Package guix. (Tue, 04 Jun 2019 13:57:02 GMT) Full text and rfc822 format available.

Message #45 received at 35785 <at> debbugs.gnu.org (full text, mbox):

From: Timothy Sample <samplet <at> ngyro.com>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: Ricardo Wurmus <rekado <at> elephly.net>, 35785 <at> debbugs.gnu.org,
 Einar Largenius <einar.largenius <at> gmail.com>
Subject: Re: bug#35785: ‘string->uri’ is
 locale-dependent and breaks in ‘sv_SE’
Date: Tue, 04 Jun 2019 09:56:39 -0400
[Message part 1 (text/plain, inline)]
Hi,

Ludovic Courtès <ludo <at> gnu.org> writes:

> Timothy Sample <samplet <at> ngyro.com> skribis:
>
> [...]
>
>> I needed to reload the modules like that to make the tests fail without
>> the patch and pass with it.  My understanding is that the bug happens
>> at regex compile time, which happens when the module is loaded.  If I
>> don’t reload the module, the old URI code passes the tests, since the
>> regexes were compiled with a locale that does not trigger the bug.  It’s
>> a little wacky, sure, but it was the best idea I could come up with.
>
> Oooh, I see.  Could you add a comment to explain this?  Then we’re done.

Here it is!  I hope it is clear.


-- Tim

[0001-Make-URI-handling-locale-independent.patch (text/x-patch, attachment)]

bug reassigned from package 'guix' to 'guile'. Request was from Ludovic Courtès <ludo <at> gnu.org> to control <at> debbugs.gnu.org. (Tue, 04 Jun 2019 19:24:01 GMT) Full text and rfc822 format available.

Reply sent to Ludovic Courtès <ludo <at> gnu.org>:
You have taken responsibility. (Tue, 04 Jun 2019 19:27:01 GMT) Full text and rfc822 format available.

Notification sent to Einar Largenius <einar.largenius <at> gmail.com>:
bug acknowledged by developer. (Tue, 04 Jun 2019 19:27:02 GMT) Full text and rfc822 format available.

Message #52 received at 35785-done <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Timothy Sample <samplet <at> ngyro.com>
Cc: Ricardo Wurmus <rekado <at> elephly.net>, 35785-done <at> debbugs.gnu.org,
 Einar Largenius <einar.largenius <at> gmail.com>
Subject: Re: bug#35785: ‘string->uri’ is
 locale-dependent and breaks in ‘sv_SE’
Date: Tue, 04 Jun 2019 21:26:25 +0200
Hi!

Timothy Sample <samplet <at> ngyro.com> skribis:

> From 9ac8643e5315d4baaddb93ee246ba8db0b3448ab Mon Sep 17 00:00:00 2001
> From: Timothy Sample <samplet <at> ngyro.com>
> Date: Sun, 2 Jun 2019 14:41:20 -0400
> Subject: [PATCH] Make URI handling locale independent.
>
> Fixes <https://bugs.gnu.org/35785>.
>
> * module/web/uri.scm (digits, hex-digits, letters): New variables.
> (ipv4-regexp, ipv6-regexp, domain-label-regexp, top-label-regexp,
> userinfo-pat, host-pat, ipv6-host-pat, port-pat, scheme-pat): Explicitly
> list each character instead of using character ranges.
> * test-suite/tests/web-uri.test: Add corresponding tests.

Perfect; pushed to the ‘stable-2.2’ branch as
420c2632bb1f48e492a035c1d216f209734f45e6.

We got a notification from the FSF that they received your copyright
assignment request too, so everything is on track.

Thank you!

Ludo’.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Wed, 03 Jul 2019 11:24:06 GMT) Full text and rfc822 format available.

This bug report was last modified 4 years and 270 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.