GNU bug report logs - #54111
guile bundles (a compiled version of) UnicodeData.txt and binaries

Package: guix;

Reported by: Maxime Devos <maximedevos <at> telenet.be>

Date: Tue, 22 Feb 2022 16:43:01 UTC

Severity: minor

Done: Ludovic Courtès <ludo <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 54111 in the body.
You can then email your comments to 54111 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-guix <at> gnu.org:
bug#54111; Package guix. (Tue, 22 Feb 2022 16:43:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Maxime Devos <maximedevos <at> telenet.be>:
New bug report received and forwarded. Copy sent to bug-guix <at> gnu.org. (Tue, 22 Feb 2022 16:43:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Maxime Devos <maximedevos <at> telenet.be>
To: bug-guix <at> gnu.org
Subject: guile bundles (a compiled version of) UnicodeData.txt and binaries
Date: Tue, 22 Feb 2022 17:42:10 +0100

[Message part 1 (text/plain, inline)]

Hi guix,

Looking at <https://git.savannah.gnu.org/cgit/guile.git/commit/?id=2f9bc7fe61d39658a24a15526b7b88bbd184961b>,
I noticed that Guile bundles a binary variant of UnicodeData.txt in
srfi-14.i.c.  Seems like it should be compiled with
the 'unidate_to_charset.pl' script instead (assuming that there are no
bootstrapping concerns).

Greetings,
Maxime.

[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-guix <at> gnu.org:
bug#54111; Package guix. (Sun, 27 Feb 2022 13:53:01 GMT) Full text and rfc822 format available.

Message #8 received at 54111 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Maxime Devos <maximedevos <at> telenet.be>
Cc: 54111 <at> debbugs.gnu.org
Subject: Re: bug#54111: guile bundles (a compiled version of)
 UnicodeData.txt and binaries
Date: Sun, 27 Feb 2022 14:52:47 +0100

Hi,

Maxime Devos <maximedevos <at> telenet.be> skribis:

> Looking at <https://git.savannah.gnu.org/cgit/guile.git/commit/?id=2f9bc7fe61d39658a24a15526b7b88bbd184961b>,
> I noticed that Guile bundles a binary variant of UnicodeData.txt in
> srfi-14.i.c.  Seems like it should be compiled with
> the 'unidate_to_charset.pl' script instead (assuming that there are no
> bootstrapping concerns).

It would add a dependency on Perl, which is not great (I’m not sure
whether it complicates bootstrapping since Perl is already present early
on, but it’s safer to avoid it.)

We could rewrite ‘unidata_to_charset.pl’ in Scheme, but then Guile would
still need to provide a pre-compiled version of srfi-14.i.c for
bootstrapping purposes.  Or we could rewrite it in Awk, since Guile
already depends on Awk anyway.

Thoughts?

Ludo’.

Information forwarded to bug-guix <at> gnu.org:
bug#54111; Package guix. (Sun, 27 Feb 2022 19:46:02 GMT) Full text and rfc822 format available.

Message #11 received at 54111 <at> debbugs.gnu.org (full text, mbox):

From: Maxime Devos <maximedevos <at> telenet.be>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 54111 <at> debbugs.gnu.org
Subject: Re: bug#54111: guile bundles (a compiled version of)
 UnicodeData.txt and binaries
Date: Sun, 27 Feb 2022 20:45:50 +0100

[Message part 1 (text/plain, inline)]

Ludovic Courtès schreef op zo 27-02-2022 om 14:52 [+0100]:
> It would add a dependency on Perl, which is not great (I’m not sure
> whether it complicates bootstrapping since Perl is already present early
> on, but it’s safer to avoid it.)
> 
> We could rewrite ‘unidata_to_charset.pl’ in Scheme, but then Guile would
> still need to provide a pre-compiled version of srfi-14.i.c for
> bootstrapping purposes.  Or we could rewrite it in Awk, since Guile
> already depends on Awk anyway.
> 
> Thoughts?

The ‘blob’ seems relatively harmless to the compilation process, so
when there are bootstrapping problems, I think we can leave it in.

However, all this Unicode is important for some other things (e.g. some
DNS and filesystem things).  So it would be nice to validate that no
attacker with access to the Guile repo stealthily introduced some wrong
information in during an otherwise routine update of the Unicode
information.

Hence, the following proposal:

  * Make perl an optional dependency of Guile (upstream) and add an
    '--with-unicode-data=[...]' configure flag or something like that.

    If perl is detected by './configure' and '--with-unicode-data=...'
    is set, then let one of the makefiles run 'unidata_to_charset.pl'
    and compare the 'new' srfi-14.i.c against the old srfi-14.i.c.

    In case of a mismatch, bail out.

    When there's no perl or --with-unicode-data, then just use the
    bundled srfi-14.i.c.

  * Add 'perl' (or 'perl-boot0' because that perl is probably good
    enough?) to the native-inputs of guile.

Actually, the second is already done in 'guile-final'.
Optionally, this can be combined with rewriting it in Scheme
or some other language.

Greetings,
Maxime.

[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-guix <at> gnu.org:
bug#54111; Package guix. (Sun, 27 Feb 2022 19:53:02 GMT) Full text and rfc822 format available.

Message #14 received at 54111 <at> debbugs.gnu.org (full text, mbox):

From: Maxime Devos <maximedevos <at> telenet.be>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 54111 <at> debbugs.gnu.org
Subject: Re: bug#54111: guile bundles (a compiled version of)
 UnicodeData.txt and binaries
Date: Sun, 27 Feb 2022 20:52:38 +0100

[Message part 1 (text/plain, inline)]

Maxime Devos schreef op zo 27-02-2022 om 20:45 [+0100]:
>   * Add 'perl' (or 'perl-boot0' because that perl is probably good
>     enough?) to the native-inputs of guile.
> 
> Actually, the second is already done in 'guile-final'.

Maybe this being done in 'guile-final' and 'guile-3.0-latest' is
sufficient?  Which package exactly verifies doesn't seem important,
as long as some package does it.

Greetings,
Maxime.

[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-guix <at> gnu.org:
bug#54111; Package guix. (Sun, 27 Feb 2022 23:08:02 GMT) Full text and rfc822 format available.

Message #17 received at 54111 <at> debbugs.gnu.org (full text, mbox):

From: Bengt Richter <bokr <at> bokr.com>
To: Maxime Devos <maximedevos <at> telenet.be>
Cc: Ludovic Courtès <ludo <at> gnu.org>, 54111 <at> debbugs.gnu.org
Subject: Re: bug#54111: guile bundles (a compiled version of) UnicodeData.txt
 and binaries
Date: Mon, 28 Feb 2022 00:07:26 +0100

Hi guix,

On +2022-02-27 20:52:38 +0100, Maxime Devos wrote:
> Maxime Devos schreef op zo 27-02-2022 om 20:45 [+0100]:
> >   * Add 'perl' (or 'perl-boot0' because that perl is probably good
> >     enough?) to the native-inputs of guile.
> > 
> > Actually, the second is already done in 'guile-final'.
> 
> Maybe this being done in 'guile-final' and 'guile-3.0-latest' is
> sufficient?  Which package exactly verifies doesn't seem important,
> as long as some package does it.
> 
> Greetings,
> Maxime.

I'm wondering how many lines of perl code
actually would have to be translated to guile
to eliminate this perl dependency.

Does the perl code upstream get changed
too often to make keeping up an acceptable chore?

(I guess I'm assuming the code is like one screenful
with a hot loop accessing a bunch of static tables.
I haven't chased it :)

-- 
Regards,
Bengt Richter

Information forwarded to bug-guix <at> gnu.org:
bug#54111; Package guix. (Mon, 28 Feb 2022 11:46:01 GMT) Full text and rfc822 format available.

Message #20 received at 54111 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Maxime Devos <maximedevos <at> telenet.be>
Cc: 54111 <at> debbugs.gnu.org
Subject: Re: bug#54111: guile bundles (a compiled version of)
 UnicodeData.txt and binaries
Date: Mon, 28 Feb 2022 12:45:45 +0100

Hi,

Maxime Devos <maximedevos <at> telenet.be> skribis:

> Ludovic Courtès schreef op zo 27-02-2022 om 14:52 [+0100]:

[...]

>> We could rewrite ‘unidata_to_charset.pl’ in Scheme, but then Guile would
>> still need to provide a pre-compiled version of srfi-14.i.c for
>> bootstrapping purposes.  Or we could rewrite it in Awk, since Guile
>> already depends on Awk anyway.
>> 
>> Thoughts?
>
> The ‘blob’ seems relatively harmless to the compilation process, so
> when there are bootstrapping problems, I think we can leave it in.
>
> However, all this Unicode is important for some other things (e.g. some
> DNS and filesystem things).  So it would be nice to validate that no
> attacker with access to the Guile repo stealthily introduced some wrong
> information in during an otherwise routine update of the Unicode
> information.

The threat model is that the repository is trusted (that’s a strong
assumption, but that’s how it is).  You cannot protect against someone
with access to the repository.

We could use ‘guix git authenticate’ to improve on that.

> Hence, the following proposal:
>
>   * Make perl an optional dependency of Guile (upstream) and add an
>     '--with-unicode-data=[...]' configure flag or something like that.
>
>     If perl is detected by './configure' and '--with-unicode-data=...'
>     is set, then let one of the makefiles run 'unidata_to_charset.pl'
>     and compare the 'new' srfi-14.i.c against the old srfi-14.i.c.
>
>     In case of a mismatch, bail out.
>
>     When there's no perl or --with-unicode-data, then just use the
>     bundled srfi-14.i.c.
>
>   * Add 'perl' (or 'perl-boot0' because that perl is probably good
>     enough?) to the native-inputs of guile.
>
> Actually, the second is already done in 'guile-final'.
> Optionally, this can be combined with rewriting it in Scheme
> or some other language.

It might be easier to rewrite in Awk in build srfi-14.i.c
unconditionally no?

We can also add ‘--with-unicode-data’, though that’s orthogonal.

Thanks,
Ludo’.

Information forwarded to bug-guix <at> gnu.org:
bug#54111; Package guix. (Mon, 28 Feb 2022 17:47:02 GMT) Full text and rfc822 format available.

Message #23 received at 54111 <at> debbugs.gnu.org (full text, mbox):

From: Maxime Devos <maximedevos <at> telenet.be>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 54111 <at> debbugs.gnu.org
Subject: Re: bug#54111: guile bundles (a compiled version of)
 UnicodeData.txt and binaries
Date: Mon, 28 Feb 2022 18:46:20 +0100

[Message part 1 (text/plain, inline)]

Ludovic Courtès schreef op ma 28-02-2022 om 12:45 [+0100]:
> It might be easier to rewrite in Awk in build srfi-14.i.c
> unconditionally no?

I don't know any Awk and it seems to be quite different from languages
I know, so for me doing that isn't easier.  But for someone who knows
some Awk, sure!

Greetings,
Maxime.

[signature.asc (application/pgp-signature, inline)]

Severity set to 'minor' from 'normal' Request was from Ludovic Courtès <ludo <at> gnu.org> to control <at> debbugs.gnu.org. (Mon, 07 Mar 2022 08:47:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-guix <at> gnu.org:
bug#54111; Package guix. (Mon, 14 Mar 2022 18:28:02 GMT) Full text and rfc822 format available.

Message #28 received at 54111 <at> debbugs.gnu.org (full text, mbox):

From: Timothy Sample <samplet <at> ngyro.com>
To: Maxime Devos <maximedevos <at> telenet.be>
Cc: Ludovic Courtès <ludo <at> gnu.org>, 54111 <at> debbugs.gnu.org
Subject: Re: bug#54111: guile bundles (a compiled version of)
 UnicodeData.txt and binaries
Date: Mon, 14 Mar 2022 12:27:14 -0600

[Message part 1 (text/plain, inline)]

Hi Maxime,

Maxime Devos <maximedevos <at> telenet.be> writes:

> Ludovic Courtès schreef op ma 28-02-2022 om 12:45 [+0100]:
>
>> It might be easier to rewrite in Awk in build srfi-14.i.c
>> unconditionally no?
>
> I don't know any Awk and it seems to be quite different from languages
> I know, so for me doing that isn't easier.  But for someone who knows
> some Awk, sure!

Well, I don’t consider myself an Awk person, but I had to implement it
for Gash-Utils, so I know it well enough!  This may not be the most
idiomatic Awk program, but to my eyes it is no less readable than the
Perl version.

Note that this Awk script needs to be invoked using something like:

    $ awk -f unidata_to_charset.awk < UnicodeData.txt > srfi-14.i.c

That is, the Perl version had the file names hard-coded, but the Awk
version reads from stdin and writes to stdout.  Also, the Awk version
does not shell out to 'indent' to post-process the file.  That was
basically a no-op in the Perl version, so I removed it.

There are a few differences in how the script is structured, and I had
to convert all the hex literals to decimal, but the logical behaviour
should be exactly the same.  I preserved all the comments and predicates
exactly from the Perl version.  There’s probably some differences in
error handling, but the input data is so simple that it shouldn’t
matter.

It runs with “gawk --posix”.  If I run “gawk --lint”, I get warnings,
but I’m pretty sure they are spurious (they may even be Gawk bugs, but I
would have to double check the relevant specs and docs).  If the lint
warnings are a problem, you can append the empty string to the argument
of the ‘hex’ function to make them go away.  Also, (as a bonus) as of
commit 62c56f9 the Gash-Utils version of Awk can run this script!  :)

Of course, to use this script as part of the Guile build, someone™ will
have to double check that we can legally redistribute the Unicode data
file (probably okay, but always good to check), and update the build
rules to generate the C file.  I can’t guarantee that I’ll get to it....

-- Tim

[unidata_to_charset.awk (text/plain, inline)]

# unidata_to_charset.awk --- Compute SRFI-14 charsets from UnicodeData.txt
#
# Copyright (C) 2009, 2010, 2022 Free Software Foundation, Inc.
#
# This library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 3 of the License, or (at your option) any later version.
#
# This library is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public
# License along with this library; if not, write to the Free Software
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA

# Utilities
###########

# Print MESSAGE to standard error, and exit with STATUS.
function die(status, message) {
    print "unidata_to_charset.awk:", message | "cat 1>&2";
    exit_status = status;
    exit exit_status;
}

# Parse the string S as a hexadecimal number.  Note that R, C, and B are
# local variables that need not be set by callers.  Most Awk
# implementations have an 'strtonum' function that we could use, but it
# is not part of POSIX.
function hex(s, r, c, b) {
    if (length(s) == 0) {
        die(1, "Cannot parse empty string as hexadecimal.");
    }
    r = 0;
    for (i = 1; i <= length(s); i++) {
        c = substr(s, i, 1);
        b = 0;
        if      (c == "0") { b =  0; }
        else if (c == "1") { b =  1; }
        else if (c == "2") { b =  2; }
        else if (c == "3") { b =  3; }
        else if (c == "4") { b =  4; }
        else if (c == "5") { b =  5; }
        else if (c == "6") { b =  6; }
        else if (c == "7") { b =  7; }
        else if (c == "8") { b =  8; }
        else if (c == "9") { b =  9; }
        else if (c == "A") { b = 10; }
        else if (c == "B") { b = 11; }
        else if (c == "C") { b = 12; }
        else if (c == "D") { b = 13; }
        else if (c == "E") { b = 14; }
        else if (c == "F") { b = 15; }
        else { die(1, "Invalid hexadecimal character: " c); }
        r *= 16;
        r += b;
    }
    return r;
}

# Program initialization
########################

BEGIN {
    # The columns are separated by semicolons.
    FS = ";";

    # This will help us handle errors.
    exit_status = 0;

    # List of charsets.
    all_charsets_count = 0;
    all_charsets[all_charsets_count++] = "lower_case";
    all_charsets[all_charsets_count++] = "upper_case";
    all_charsets[all_charsets_count++] = "title_case";
    all_charsets[all_charsets_count++] = "letter";
    all_charsets[all_charsets_count++] = "digit";
    all_charsets[all_charsets_count++] = "hex_digit";
    all_charsets[all_charsets_count++] = "letter_plus_digit";
    all_charsets[all_charsets_count++] = "graphic";
    all_charsets[all_charsets_count++] = "whitespace";
    all_charsets[all_charsets_count++] = "printing";
    all_charsets[all_charsets_count++] = "iso_control";
    all_charsets[all_charsets_count++] = "punctuation";
    all_charsets[all_charsets_count++] = "symbol";
    all_charsets[all_charsets_count++] = "blank";
    all_charsets[all_charsets_count++] = "ascii";
    all_charsets[all_charsets_count++] = "empty";
    all_charsets[all_charsets_count++] = "designated";

    # Initialize charset state table.
    for (i in all_charsets) {
        cs = all_charsets[i];
        state[cs, "start"] = -1;
        state[cs, "end"] = -1;
        state[cs, "count"] = 0;
    }
}

# Record initialization
#######################

# In this block we give names to each field, and do some basic
# initialization.
{
    codepoint = hex($1);
    name = $2;
    category = $3;
    uppercase = $13;
    lowercase = $14;

    codepoint_end = codepoint;
    charset_index = 0;
    for (i in charsets) {
        delete charsets[i];
    }
}

# Some pairs of lines in UnicodeData.txt delimit ranges of
# characters.
name ~ /First>$/ {
    getline;
    last_name = name;
    sub(/First>$/, "Last>", last_name);
    if (last_name != $2) {
        die(1, "Invalid range in Unicode data.");
        exit_status = 1;
        exit 1;
    }
    codepoint_end = hex($1);
}

# Character set predicates
##########################

## The lower_case character set
###############################

# For Unicode, we follow Java's specification: a character is
# lowercase if
#    * it is not in the range [U+2000,U+2FFF] ([8192,12287]), and
#    * the Unicode attribute table does not give a lowercase mapping
#      for it, and
#    * at least one of the following is true:
#          o the Unicode attribute table gives a mapping to uppercase
#            for the character, or
#          o the name for the character in the Unicode attribute table
#            contains the words "SMALL LETTER" or "SMALL LIGATURE".

(codepoint < 8192 || codepoint > 12287) &&
lowercase == "" &&
(uppercase != "" || name ~ /(SMALL LETTER|SMALL LIGATURE)/) {
    charsets[charset_index++] = "lower_case";
}

## The upper_case character set
###############################

# For Unicode, we follow Java's specification: a character is
# uppercase if
#    * it is not in the range [U+2000,U+2FFF] ([8192,12287]), and
#    * the Unicode attribute table does not give an uppercase mapping
#      for it (this excludes titlecase characters), and
#    * at least one of the following is true:
#          o the Unicode attribute table gives a mapping to lowercase
#            for the character, or
#          o the name for the character in the Unicode attribute table
#            contains the words "CAPITAL LETTER" or "CAPITAL LIGATURE".

(codepoint < 8192 || codepoint > 12287) &&
uppercase == "" &&
(lowercase != "" || name ~ /(CAPITAL LETTER|CAPITAL LIGATURE)/) {
    charsets[charset_index++] = "upper_case";
}

## The title_case character set
###############################

# A character is titlecase if it has the category Lt in the character
# attribute database.

category == "Lt" {
    charsets[charset_index++] = "title_case";
}

## The letter character set
###########################

# A letter is any character with one of the letter categories (Lu, Ll,
# Lt, Lm, Lo) in the Unicode character database.

category == "Lu" ||
category == "Ll" ||
category == "Lt" ||
category == "Lm" ||
category == "Lo" {
    charsets[charset_index++] = "letter";
    charsets[charset_index++] = "letter_plus_digit";
}

## The digit character set
##########################

# A character is a digit if it has the category Nd in the character
# attribute database. In Latin-1 and ASCII, the only such characters
# are 0123456789. In Unicode, there are other digit characters in
# other code blocks, such as Gujarati digits and Tibetan digits.

category == "Nd" {
    charsets[charset_index++] = "digit";
    charsets[charset_index++] = "letter_plus_digit";
}

## The hex_digit character set
##############################

# The only hex digits are 0123456789abcdefABCDEF.

(codepoint >= 48 && codepoint <= 57) ||
(codepoint >= 65 && codepoint <= 70) ||
(codepoint >= 97 && codepoint <= 102) {
    charsets[charset_index++] = "hex_digit";
}

## The graphic character set
############################

# Characters that would 'use ink' when printed

category ~ /L|M|N|P|S/ {
    charsets[charset_index++] = "graphic";
    charsets[charset_index++] = "printing";
}

## The whitespace character set
###############################

# A whitespace character is either
#    * a character with one of the space, line, or paragraph separator
#      categories (Zs, Zl or Zp) of the Unicode character database.
#    * U+0009 (09) Horizontal tabulation (\t control-I)
#    * U+000A (10) Line feed (\n control-J)
#    * U+000B (11) Vertical tabulation (\v control-K)
#    * U+000C (12) Form feed (\f control-L)
#    * U+000D (13) Carriage return (\r control-M)

category ~ /Zs|Zl|Zp/ ||
(codepoint >= 9 && codepoint <= 13) {
    charsets[charset_index++] = "whitespace";
    charsets[charset_index++] = "printing";
}

## The iso_control character set
################################

# The ISO control characters are the Unicode/Latin-1 characters in the
# ranges [U+0000,U+001F] ([0,31]) and [U+007F,U+009F] ([127,159]).

(codepoint >= 0 && codepoint <= 31) ||
(codepoint >= 127 && codepoint <= 159) {
    charsets[charset_index++] = "iso_control";
}

## The punctuation character set
################################

# A punctuation character is any character that has one of the
# punctuation categories in the Unicode character database (Pc, Pd,
# Ps, Pe, Pi, Pf, or Po.)

# Note that srfi-14 gives conflicting requirements!!  It claims that
# only the Unicode punctuation is necessary, but, explicitly calls out
# the soft hyphen character (U+00AD) as punctution.  Current versions
# of Unicode consider U+00AD to be a formatting character, not
# punctuation.

category ~ /P/ {
    charsets[charset_index++] = "punctuation";
}

## The symbol character set
###########################

# A symbol is any character that has one of the symbol categories in
# the Unicode character database (Sm, Sc, Sk, or So).

category ~ /S/ {
    charsets[charset_index++] = "symbol";
}

## The blank character set
##########################

# Blank chars are horizontal whitespace.  A blank character is either
#    * a character with the space separator category (Zs) in the
#      Unicode character database.
#    * U+0009 (9) Horizontal tabulation (\t control-I)

category ~ /Zs/ || codepoint == 9 {
    charsets[charset_index++] = "blank";
}

## The ascii character set
##########################

codepoint <= 127 {
    charsets[charset_index++] = "ascii";
}

## The designated character set
###############################

category !~ /Cs/ {
    charsets[charset_index++] = "designated";
}

## Other character sets
#######################

# Note that the "letter_plus_digit" and "printing" character sets, which
# are unions of other character sets, are included in the patterns
# matching their constituent parts (i.e., the "letter_plus_digit"
# character set is included as part of the "letter" and "digit"
# patterns).
#
# Also, the "empty" character is computed by doing precisely nothing!

# Keeping track of state
########################

# Update the state for each charset.
{
    for (i in charsets) {
        cs = charsets[i];
        if (state[cs, "start"] == -1) {
            state[cs, "start"] = codepoint;
            state[cs, "end"] = codepoint_end;
        } else if (state[cs, "end"] + 1 == codepoint) {
            state[cs, "end"] = codepoint_end;
        } else {
            count = state[cs, "count"];
            state[cs, "count"]++;
            state[cs, "ranges", count, 0] = state[cs, "start"];
            state[cs, "ranges", count, 1] = state[cs, "end"];
            state[cs, "start"] = codepoint;
            state[cs, "end"] = codepoint_end;
        }
    }
}

# Printing and error handling
#############################

END {
    # Normally, an exit statement runs all the 'END' blocks before
    # actually exiting.  We use the 'exit_status' variable to short
    # circuit the rest of the 'END' block by reissuing the exit
    # statement.
    if (exit_status != 0) {
        exit exit_status;
    }

    # Write a bit of a header.
    print("/* srfi-14.i.c -- standard SRFI-14 character set data */");
    print("");
    print("/* This file is #include'd by srfi-14.c.  */");
    print("");
    print("/* This file was generated from");
    print("   http://unicode.org/Public/UNIDATA/UnicodeData.txt");
    print("   with the unidata_to_charset.awk script.  */");
    print("");

    for (i = 0; i < all_charsets_count; i++) {
        cs = all_charsets[i];

        # Extra logic to ensure that the last range is included.
        if (state[cs, "start"] != -1) {
            count = state[cs, "count"];
            state[cs, "count"]++;
            state[cs, "ranges", count, 0] = state[cs, "start"];
            state[cs, "ranges", count, 1] = state[cs, "end"];
        }

        count = state[cs, "count"];

        print("static const scm_t_char_range cs_" cs "_ranges[] = {");
        for (j = 0; j < count; j++) {
            rstart = state[cs, "ranges", j, 0];
            rend = state[cs, "ranges", j, 1];
            if (j + 1 < count) {
                printf("  {0x%04x, 0x%04x},\n", rstart, rend);
            } else {
                printf("  {0x%04x, 0x%04x}\n", rstart, rend);
            }
        }
        print("};");
        print("");

        count = state[cs, "count"];
        printf("static const size_t cs_%s_len = %d;\n", cs, count);
        if (i + 1 < all_charsets_count) {
            print("");
        }
    }
}

# And we're done.

Information forwarded to bug-guix <at> gnu.org:
bug#54111; Package guix. (Wed, 16 Mar 2022 10:49:01 GMT) Full text and rfc822 format available.

Message #31 received at 54111 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Timothy Sample <samplet <at> ngyro.com>
Cc: Maxime Devos <maximedevos <at> telenet.be>, 54111 <at> debbugs.gnu.org
Subject: Re: bug#54111: guile bundles (a compiled version of)
 UnicodeData.txt and binaries
Date: Wed, 16 Mar 2022 11:47:56 +0100

Hi Tim,

Timothy Sample <samplet <at> ngyro.com> skribis:

> Well, I don’t consider myself an Awk person, but I had to implement it
> for Gash-Utils, so I know it well enough!  This may not be the most
> idiomatic Awk program, but to my eyes it is no less readable than the
> Perl version.

You rock!

[...]

> It runs with “gawk --posix”.  If I run “gawk --lint”, I get warnings,
> but I’m pretty sure they are spurious (they may even be Gawk bugs, but I
> would have to double check the relevant specs and docs).  If the lint
> warnings are a problem, you can append the empty string to the argument
> of the ‘hex’ function to make them go away.  Also, (as a bonus) as of
> commit 62c56f9 the Gash-Utils version of Awk can run this script!  :)

Incredible.  :-)

> Of course, to use this script as part of the Guile build, someone™ will
> have to double check that we can legally redistribute the Unicode data
> file (probably okay, but always good to check), and update the build
> rules to generate the C file.  I can’t guarantee that I’ll get to it....

I’ll check with Andy if he’s fine with this option.  Would you like to
turn it into a patch against Guile?  If not, I could do that.

> # unidata_to_charset.awk --- Compute SRFI-14 charsets from UnicodeData.txt
> #
> # Copyright (C) 2009, 2010, 2022 Free Software Foundation, Inc.

Is this correct?  (Maybe yes because it’s a translation of the original
Perl script, right?)

Thanks a lot!

Ludo’.

Information forwarded to bug-guix <at> gnu.org:
bug#54111; Package guix. (Wed, 16 Mar 2022 23:43:02 GMT) Full text and rfc822 format available.

Message #34 received at 54111 <at> debbugs.gnu.org (full text, mbox):

From: Timothy Sample <samplet <at> ngyro.com>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: Maxime Devos <maximedevos <at> telenet.be>, 54111 <at> debbugs.gnu.org
Subject: Re: bug#54111: guile bundles (a compiled version of)
 UnicodeData.txt and binaries
Date: Wed, 16 Mar 2022 17:42:13 -0600

Hi Ludo,

Ludovic Courtès <ludo <at> gnu.org> writes:

> Timothy Sample <samplet <at> ngyro.com> skribis:
>
>> Of course, to use this script as part of the Guile build, someone™ will
>> have to double check that we can legally redistribute the Unicode data
>> file (probably okay, but always good to check), and update the build
>> rules to generate the C file.  I can’t guarantee that I’ll get to it....
>
> I’ll check with Andy if he’s fine with this option.  Would you like to
> turn it into a patch against Guile?  If not, I could do that.

I’ll do it.  It always feels good to submit a patch!

>> # unidata_to_charset.awk --- Compute SRFI-14 charsets from UnicodeData.txt
>> #
>> # Copyright (C) 2009, 2010, 2022 Free Software Foundation, Inc.
>
> Is this correct?  (Maybe yes because it’s a translation of the original
> Perl script, right?)

That’s my understanding.  This is technically a modification of the
original work, so the old copyright years are still relevant.


-- Tim

Information forwarded to bug-guix <at> gnu.org:
bug#54111; Package guix. (Sat, 19 Mar 2022 18:21:01 GMT) Full text and rfc822 format available.

Message #37 received at 54111 <at> debbugs.gnu.org (full text, mbox):

From: Timothy Sample <samplet <at> ngyro.com>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: Maxime Devos <maximedevos <at> telenet.be>, 54111 <at> debbugs.gnu.org
Subject: Re: bug#54111: guile bundles (a compiled version of)
 UnicodeData.txt and binaries
Date: Sat, 19 Mar 2022 12:20:17 -0600

[Message part 1 (text/plain, inline)]

Hi again,

Timothy Sample <samplet <at> ngyro.com> writes:

> Ludovic Courtès <ludo <at> gnu.org> writes:
>
>> Timothy Sample <samplet <at> ngyro.com> skribis:
>>
>>> Of course, to use this script as part of the Guile build, someone™ will
>>> have to double check that we can legally redistribute the Unicode data
>>> file (probably okay, but always good to check), and update the build
>>> rules to generate the C file.  I can’t guarantee that I’ll get to it....
>>
>> I’ll check with Andy if he’s fine with this option.  Would you like to
>> turn it into a patch against Guile?  If not, I could do that.
>
> I’ll do it.  It always feels good to submit a patch!

I’ve attached two patches, the second of which is gzipped (the
UnicodeData.txt file is nearly 2M).

The first patch replaces the Perl script with the Awk script.  The Awk
script produces an identical ‘srfi-14.i.c’, except for changing “.pl” to
“.awk” in a comment.

The second patch removes ‘srfi-14.i.c’, adds ‘UnicodeData.txt’, and
teaches the build machinery how to generate the former from the latter.
I did my best with the Makefile, but I’m still a noob when it comes to
Automake conventions.  This is the part that warrants the most review!
Finally, I added support for comments to the Awk script so that I could
put the Unicode license text in the data file itself.  This is probably
the simplest way to dispatch our legal obligations to Unicode, Inc. (and
follow the guidelines of the FSF).  For all the details, see
<https://www.unicode.org/copyright.html> and
<https://www.gnu.org/licenses/license-list.html#Unicode>.

-- Tim

[0001-Reimplement-unidata_to_charset.pl-in-Awk.patch (text/x-patch, attachment)]

[0002-Create-srfi-14.i.c-during-build.patch.gz (application/octet-stream, attachment)]

Reply sent to Ludovic Courtès <ludo <at> gnu.org>:
You have taken responsibility. (Thu, 24 Mar 2022 13:34:02 GMT) Full text and rfc822 format available.

Notification sent to Maxime Devos <maximedevos <at> telenet.be>:
bug acknowledged by developer. (Thu, 24 Mar 2022 13:34:03 GMT) Full text and rfc822 format available.

Message #42 received at 54111-done <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Timothy Sample <samplet <at> ngyro.com>
Cc: Maxime Devos <maximedevos <at> telenet.be>, 54111-done <at> debbugs.gnu.org
Subject: Re: bug#54111: guile bundles (a compiled version of)
 UnicodeData.txt and binaries
Date: Thu, 24 Mar 2022 14:33:38 +0100

Hello,

Timothy Sample <samplet <at> ngyro.com> skribis:

> I’ve attached two patches, the second of which is gzipped (the
> UnicodeData.txt file is nearly 2M).
>
> The first patch replaces the Perl script with the Awk script.  The Awk
> script produces an identical ‘srfi-14.i.c’, except for changing “.pl” to
> “.awk” in a comment.
>
> The second patch removes ‘srfi-14.i.c’, adds ‘UnicodeData.txt’, and
> teaches the build machinery how to generate the former from the latter.
> I did my best with the Makefile, but I’m still a noob when it comes to
> Automake conventions.  This is the part that warrants the most review!
> Finally, I added support for comments to the Awk script so that I could
> put the Unicode license text in the data file itself.  This is probably
> the simplest way to dispatch our legal obligations to Unicode, Inc. (and
> follow the guidelines of the FSF).  For all the details, see
> <https://www.unicode.org/copyright.html> and
> <https://www.gnu.org/licenses/license-list.html#Unicode>.

This all looks good to me.

Pushed in Guile as commit 9f8e05e513399985021643c34217f45d65c66392,
thank you!

Ludo’.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 22 Apr 2022 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 1 year and 363 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #54111 guile bundles (a compiled version of) UnicodeData.txt and binaries

GNU bug report logs - #54111
guile bundles (a compiled version of) UnicodeData.txt and binaries