mirror of
https://forge.sourceware.org/marek/gcc.git
synced 2026-02-22 03:47:02 -05:00
The following patch updates GCC from Unicode 16.0.0 to 17.0.0. I've followed what the README says and updated also one script from glibc, but that needed another Unicode file - HangulSyllableType.txt - around as well, so I'm adding it. I've added one new test to named-universal-char-escape-1.c for randomly chosen character from new CJK block. Note, Unicode 17.0.0 authors forgot to adjust the 4-8 table, I've filed bugreports about that but the UnicodeData.txt changes for the range ends and the new range seems to match e.g. what is in the glyph tables, so the patch follows UnicodeData.txt and not 4-8 table here. Another thing was that makeuname2c.cc didn't handle correctly when the size of the generated string table modulo 77 was 76 or 77, in which case it forgot to emit a semicolon after the string literal and so failed to compile. And as can be seen in the emoji-data.txt diff, some properties like Extended_Pictographic have been removed from certain characters, e.g. from the Mahjong cards characters except U+1F004, and one libstdc++ test was testing that property exactly on U+1F000. Dunno why that was changed, but U+1F004 is the only colored one among tons of black and white ones. 2025-10-08 Jakub Jelinek <jakub@redhat.com> contrib/ * unicode/README: Add HangulSyllableType.txt file to the list as newest utf8_gen.py from glibc now needs it. Adjust git commit hash and change unicode 16 version to 17. * unicode/from_glibc/utf8_gen.py: Updated from glibc. * unicode/DerivedCoreProperties.txt: Updated from Unicode 17.0.0. * unicode/emoji-data.txt: Likewise. * unicode/PropList.txt: Likewise. * unicode/GraphemeBreakProperty.txt: Likewise. * unicode/DerivedNormalizationProps.txt: Likewise. * unicode/NameAliases.txt: Likewise. * unicode/UnicodeData.txt: Likewise. * unicode/EastAsianWidth.txt: Likewise. * unicode/DerivedGeneralCategory.txt: Likewise. * unicode/HangulSyllableType.txt: New file. gcc/testsuite/ * c-c++-common/cpp/named-universal-char-escape-1.c: Add test for \N{CJK UNIFIED IDEOGRAPH-3340E}. libcpp/ * makeucnid.cc (write_copyright): Adjust copyright year. * makeuname2c.cc (generated_ranges): Adjust end points for a couple of ranges based on UnicodeData.txt Last changes and add a whole new CJK UNIFIED IDEOGRAPH- entry. None of these changes are in the 4-8 table, but clearly it has just been forgotten. (write_copyright): Adjust copyright year. (write_dict): Fix up condition when to print semicolon. * generated_cpp_wcwidth.h: Regenerate. * ucnid.h: Regenerate. * uname2c.h: Regenerate. libstdc++-v3/ * include/bits/unicode-data.h: Regenerate. * testsuite/ext/unicode/properties.cc: Test __is_extended_pictographic on U+1F004 rather than U+1F000.
84 lines
3.7 KiB
Plaintext
84 lines
3.7 KiB
Plaintext
This directory contains a mechanism for GCC to have its own internal
|
|
implementation of wcwidth functionality (cpp_wcwidth () in libcpp/charset.c),
|
|
as well as a mechanism to update the information about codepoints permitted in
|
|
identifiers, which is encoded in libcpp/ucnid.h, and mapping between Unicode
|
|
names and codepoints, which is encoded in libcpp/uname2c.h.
|
|
|
|
The idea is to produce the necessary lookup tables
|
|
(../../libcpp/{ucnid.h,uname2c.h,generated_cpp_wcwidth.h}) in a reproducible
|
|
way, starting from the following files that are distributed by the Unicode
|
|
Consortium:
|
|
|
|
ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
|
|
ftp://ftp.unicode.org/Public/UNIDATA/EastAsianWidth.txt
|
|
ftp://ftp.unicode.org/Public/UNIDATA/PropList.txt
|
|
ftp://ftp.unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt
|
|
ftp://ftp.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt
|
|
ftp://ftp.unicode.org/Public/UNIDATA/NameAliases.txt
|
|
ftp://ftp.unicode.org/Public/UNIDATA/HangulSyllableType.txt
|
|
|
|
Three additional files are needed for lookup tables in libstdc++:
|
|
|
|
ftp://ftp.unicode.org/Public/UNIDATA/auxiliary/GraphemeBreakProperty.txt
|
|
ftp://ftp.unicode.org/Public/UNIDATA/emoji/emoji-data.txt
|
|
ftp://ftp.unicode.org/Public/UNIDATA/extracted/DerivedGeneralCategory.txt
|
|
|
|
All these files have been added to source control in this directory;
|
|
please see unicode-license.txt for the relevant copyright information.
|
|
|
|
In order to keep in sync with glibc's wcwidth as much as possible, it is
|
|
desirable for the logic that processes the Unicode data to be the same as
|
|
glibc's. To that end, we also put in this directory, in the from_glibc/
|
|
directory, the glibc python code that implements their logic. This code was
|
|
copied verbatim from glibc, and it can be updated at any time from the glibc
|
|
source code repository. The files copied from that repository are:
|
|
|
|
localedata/unicode-gen/unicode_utils.py
|
|
localedata/unicode-gen/utf8_gen.py
|
|
|
|
And the most recent versions added to GCC are from glibc git commit:
|
|
2642002380aafb71a1d3b569b6d7ebeab3284816
|
|
|
|
The script gen_wcwidth.py found here contains the GCC-specific code to
|
|
map glibc's output to the lookup tables we require. This script should not need
|
|
to change, unless there are structural changes to the Unicode data files or to
|
|
the glibc code. Similarly, makeucnid.cc in ../../libcpp contains the logic to
|
|
produce ucnid.h.
|
|
|
|
The procedure to update GCC's Unicode support is the following:
|
|
|
|
1. Update the six Unicode data files from the above URLs.
|
|
|
|
2. Update the two glibc files in from_glibc/ from glibc's git. Update
|
|
the commit number above in this README.
|
|
|
|
3. Run ./gen_wcwidth.py X.Y > ../../libcpp/generated_cpp_wcwidth.h
|
|
(where X.Y is the version of the Unicode standard corresponding to the
|
|
Unicode data files being used, most recently, 17.0.0).
|
|
|
|
4. Update Unicode Copyright years in libcpp/makeucnid.cc and in
|
|
libcpp/makeuname2c.cc up to the year in which the Unicode
|
|
standard has been released.
|
|
|
|
5. Compile makeucnid, e.g. with:
|
|
g++ -O2 ../../libcpp/makeucnid.cc -o ../../libcpp/makeucnid
|
|
|
|
6. Generate ucnid.h as follows:
|
|
../../libcpp/makeucnid ../../libcpp/ucnid.tab UnicodeData.txt \
|
|
DerivedNormalizationProps.txt DerivedCoreProperties.txt \
|
|
> ../../libcpp/ucnid.h
|
|
|
|
7. Read the corresponding Unicode's standard and update correspondingly
|
|
generated_ranges table in libcpp/makeuname2c.cc (in Unicode 17 all
|
|
the needed information was in Table 4-8).
|
|
|
|
8. Compile makeuname2c, e.g. with:
|
|
g++ -O2 ../../libcpp/makeuname2c.cc -o ../../libcpp/makeuname2c
|
|
|
|
9: Generate uname2c.h as follows:
|
|
../../libcpp/makeuname2c UnicodeData.txt NameAliases.txt \
|
|
> ../../libcpp/uname2c.h
|
|
|
|
See gen_libstdcxx_unicode_data.py for instructions on updating the lookup
|
|
tables in libstdc++.
|