GNU bug report logs - #38235
string-foldcase bug for trailing sigma

Previous Next

Package: guile;

Reported by: Andy Wingo <wingo <at> pobox.com>

Date: Sat, 16 Nov 2019 20:42:02 UTC

Severity: normal

To reply to this bug, email your comments to 38235 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-guile <at> gnu.org:
bug#38235; Package guile. (Sat, 16 Nov 2019 20:42:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Andy Wingo <wingo <at> pobox.com>:
New bug report received and forwarded. Copy sent to bug-guile <at> gnu.org. (Sat, 16 Nov 2019 20:42:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Andy Wingo <wingo <at> pobox.com>
To: bug-guile <at> gnu.org
Subject: string-foldcase bug for trailing sigma
Date: Sat, 16 Nov 2019 21:41:05 +0100
Given the following example, using (rnrs unicode):

  (string-foldcase "ΜΈΛΟΣ")

The expected result is "μέλοσ"; see R6RS libraries section 1.2.  However
instead Guile's result is "μέλος".  Note that although Σ usually
downcases to σ, at the end of a string it's ς.  This test shows a
limitation of defining string-foldcase as simply (string-downcase
(string-upcase str)).




Information forwarded to bug-guile <at> gnu.org:
bug#38235; Package guile. (Sun, 17 Nov 2019 11:20:02 GMT) Full text and rfc822 format available.

Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):

From: <tomas <at> tuxteam.de>
To: bug-guile <at> gnu.org
Subject: Re: bug#38235: string-foldcase bug for trailing sigma
Date: Sun, 17 Nov 2019 12:19:18 +0100
[Message part 1 (text/plain, inline)]
On Sat, Nov 16, 2019 at 09:41:05PM +0100, Andy Wingo wrote:
> Given the following example, using (rnrs unicode):
> 
>   (string-foldcase "ΜΈΛΟΣ")

Good catch. I think there's even a worse example: dotless
and dotted I [1]. Here it seems even impossible to do
up- and downcase correctly without knowing the language
context.

Cheers
[1] https://en.wikipedia.org/wiki/%C4%B0
-- tomás
[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-guile <at> gnu.org:
bug#38235; Package guile. (Sun, 17 Nov 2019 18:14:02 GMT) Full text and rfc822 format available.

Message #11 received at 38235 <at> debbugs.gnu.org (full text, mbox):

From: John Cowan <cowan <at> ccil.org>
To: Andy Wingo <wingo <at> pobox.com>, tomas <at> tuxteam.de
Cc: 38235 <at> debbugs.gnu.org
Subject: Re: bug#38235: string-foldcase bug for trailing sigma
Date: Sun, 17 Nov 2019 13:13:42 -0500
[Message part 1 (text/plain, inline)]
On Sat, Nov 16, 2019 at 3:42 PM Andy Wingo <wingo <at> pobox.com> wrote:


> The expected result is "μέλοσ"; see R6RS libraries section 1.2.  However
> instead Guile's result is "μέλος".  Note that although Σ usually
> downcases to σ, at the end of a string it's ς.


More precisely, it downcases to σ if a letter follows and to ς if not
(being at the end of a string is a particular case).  However, this is not
actually always Greekly correct:  the string "ΦΙΛΟΣ." with a period at the
end downcases to "φιλος." if it is the word φίλος 'friend' (without its
proper accent) at the end of a sentence, but as "φιλος." if it is an
abbreviation for φιλοσοφία 'philosophy'.  For this reason, R7RS does not
require mapping to  ς in this situation as R6RS does.

This test shows a
> limitation of defining string-foldcase as simply (string-downcase
> (string-upcase str)).
>

As explained in Unicode section 5.18, the foldcase mappings (in <
https://www.unicode.org/Public/UNIDATA/CaseFolding.txt>, the lines with
status C and F) actually create a set of equivalence classes that are
closed under {upper,lower,title}case mapping, and then choose a single
character to represent each class.  This is usually the unique lowercase
character, but not always: in Cherokee it is the uppercase character, and
in the set {Σ, σ, ς} it is  σ.

On Sun, Nov 17, 2019 at 6:20 AM <tomas <at> tuxteam.de> wrote:

Good catch. I think there's even a worse example: dotless
> and dotted I [1]. Here it seems even impossible to do
> up- and downcase correctly without knowing the language
> context.
>

Language-specific case mappings are explicitly out of Scheme's remit: they
have to be performed by specialized libraries.  There is an additional
situation in Lithuanian dictionaries (but not running text): an "i" with a
tone accent is represented as "i" + dot above + accent, like this:  "i̇́".
However, this dot above must be dropped when uppercasing, producing
ordinary "Í".
[Message part 2 (text/html, inline)]

This bug report was last modified 4 years and 160 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.