9 Commits

Author SHA1 Message Date
Jakub Jelinek
0c0847158c Update to Unicode 17.0.0
The following patch updates GCC from Unicode 16.0.0 to 17.0.0.

I've followed what the README says and updated also one script from
glibc, but that needed another Unicode file - HangulSyllableType.txt -
around as well, so I'm adding it.
I've added one new test to named-universal-char-escape-1.c for
randomly chosen character from new CJK block.
Note, Unicode 17.0.0 authors forgot to adjust the 4-8 table, I've filed
bugreports about that but the UnicodeData.txt changes for the range ends
and the new range seems to match e.g. what is in the glyph tables, so
the patch follows UnicodeData.txt and not 4-8 table here.

Another thing was that makeuname2c.cc didn't handle correctly when
the size of the generated string table modulo 77 was 76 or 77, in which
case it forgot to emit a semicolon after the string literal and so failed
to compile.

And as can be seen in the emoji-data.txt diff, some properties like
Extended_Pictographic have been removed from certain characters, e.g.
from the Mahjong cards characters except U+1F004, and one libstdc++
test was testing that property exactly on U+1F000.  Dunno why that was
changed, but U+1F004 is the only colored one among tons of black and white
ones.

2025-10-08  Jakub Jelinek  <jakub@redhat.com>

contrib/
	* unicode/README: Add HangulSyllableType.txt file to the
	list as newest utf8_gen.py from glibc now needs it.  Adjust
	git commit hash and change unicode 16 version to 17.
	* unicode/from_glibc/utf8_gen.py: Updated from glibc.
	* unicode/DerivedCoreProperties.txt: Updated from Unicode 17.0.0.
	* unicode/emoji-data.txt: Likewise.
	* unicode/PropList.txt: Likewise.
	* unicode/GraphemeBreakProperty.txt: Likewise.
	* unicode/DerivedNormalizationProps.txt: Likewise.
	* unicode/NameAliases.txt: Likewise.
	* unicode/UnicodeData.txt: Likewise.
	* unicode/EastAsianWidth.txt: Likewise.
	* unicode/DerivedGeneralCategory.txt: Likewise.
	* unicode/HangulSyllableType.txt: New file.
gcc/testsuite/
	* c-c++-common/cpp/named-universal-char-escape-1.c: Add test for
	\N{CJK UNIFIED IDEOGRAPH-3340E}.
libcpp/
	* makeucnid.cc (write_copyright): Adjust copyright year.
	* makeuname2c.cc (generated_ranges): Adjust end points for a couple
	of ranges based on UnicodeData.txt Last changes and add a whole new
	CJK UNIFIED IDEOGRAPH- entry.  None of these changes are in the 4-8
	table, but clearly it has just been forgotten.
	(write_copyright): Adjust copyright year.
	(write_dict): Fix up condition when to print semicolon.
	* generated_cpp_wcwidth.h: Regenerate.
	* ucnid.h: Regenerate.
	* uname2c.h: Regenerate.
libstdc++-v3/
	* include/bits/unicode-data.h: Regenerate.
	* testsuite/ext/unicode/properties.cc: Test __is_extended_pictographic
	on U+1F004 rather than U+1F000.
2025-10-08 18:02:39 +02:00
Jakub Jelinek
29bc14c750 Update copyright years. 2025-01-02 12:17:04 +01:00
Jakub Jelinek
d0e8f58b81 contrib, libcpp, libstdc++: Update to Unicode 16.0
It is autumn again and there is a new Unicode version 16.0.

The following patch updates our Unicode stuff in contrib, libcpp and
libstdc++ from that Unicode version.

2024-10-08  Jakub Jelinek  <jakub@redhat.com>

contrib/
	* unicode/README: Update glibc git commit hash, replace
	Unicode 15 or 15.1 versions with 16.
	* unicode/gen_libstdcxx_unicode_data.py: Use 160000 instead of
	150100 in _GLIBCXX_GET_UNICODE_DATA test.
	* unicode/from_glibc/utf8_gen.py: Updated from glibc
	064c708c78cc2a6b5802dce73108fc0c1c6bfc80 commit.
	* unicode/DerivedCoreProperties.txt: Updated from Unicode 16.0.
	* unicode/emoji-data.txt: Likewise.
	* unicode/PropList.txt: Likewise.
	* unicode/GraphemeBreakProperty.txt: Likewise.
	* unicode/DerivedNormalizationProps.txt: Likewise.
	* unicode/NameAliases.txt: Likewise.
	* unicode/UnicodeData.txt: Likewise.
	* unicode/EastAsianWidth.txt: Likewise.
gcc/testsuite/
	* c-c++-common/cpp/named-universal-char-escape-1.c: Add tests
	for some Unicode 16.0 characters, both normal and generated.
libcpp/
	* makeucnid.cc (write_copyright): Update Unicode Copyright years.
	* makeuname2c.cc (generated_ranges): Adjust Unicode version from 15.1
	to 16.0.  Add EGYPTIAN HIEROGLYPH- generated range, adjust indexes in
	following entries.
	(write_copyright): Update Unicode Copyright years.
	* generated_cpp_wcwidth.h: Regenerated.
	* ucnid.h: Regenerated.
	* uname2c.h: Regenerated.
libstdc++-v3/
	* include/bits/unicode.h (std::__unicode::__v15_1_0): Rename inline
	namespace to ...
	(std::__unicode::__v16_0_0): ... this.
	(_GLIBCXX_GET_UNICODE_DATA): Change from 150100 to 160000.
	* include/bits/unicode-data.h: Regenerated.
	* testsuite/ext/unicode/properties.cc: Check for _Gcb_SpacingMark
	on U+11F03 rather than U+1D16D as the latter lost SpacingMark property
	in Unicode 16.0.
2024-10-08 10:01:47 +02:00
Jakub Jelinek
a945c346f5 Update copyright years. 2024-01-03 12:19:35 +01:00
Jakub Jelinek
d64b7c82da libcpp, contrib: Update to Unicode 15.1
The following patch (in plaintext just a pseudo-patch where I've left out
the too big parts of either wget downloaded or regenerated files out with
..., full patch attached compressed) updates to Unicode 15.1 from 15.0
we had last year.  Apparently Unicode forgot to add a new range to 4-8 Table
we are using, but from the other files it is clear what should have been
added; I've filed a bugreport against Unicode.

2023-11-14  Jakub Jelinek  <jakub@redhat.com>

contrib/
	* unicode/README: Adjust glibc git commit hash, number of Unicode
	data files to be updated and latest Unicode version.
	* unicode/from_glibc/utf8_gen.py: Update from glibc.
	* unicode/UnicodeData.txt: Update from Unicode 15.1.
	* unicode/EastAsianWidth.txt: Likewise.
	* unicode/DerivedNormalizationProps.txt: Likewise.
	* unicode/NameAliases.txt: Likewise.
	* unicode/DerivedCoreProperties.txt: Likewise.
	* unicode/PropList.txt: Likewise.
libcpp/
	* makeucnid.cc (write_copyright): Update copyright year.
	* makeuname2c.cc (write_copyright): Likewise.
	(struct generated): Update latest Unicode version.
	(generated_ranges): Add 2ebf0-2ee5d CJK UNIFIED IDEOGRAPH
	range which was forgotten to be added to 4-8 table, but
	clearly is expected to be there from the 15.1 additions.
	* ucnid.h: Regenerated.
	* uname2c.h: Regenerated.
	* generated_cpp_wcwidth.h: Regenerated.
2023-11-14 18:32:37 +01:00
Jakub Jelinek
99bae6ee66 libcpp: Update Unicode copyright years
I've noticed I forgot to update copyright years when updating from
Unicode 15.0.0 (and makeucnid.cc had it hopelessly obsolete).

2023-03-16  Jakub Jelinek  <jakub@redhat.com>

	* makeucnid.cc (write_copyright): Update Unicode copyright years
	up to 2022.
	* makeuname2c.cc (write_copyright): Likewise.
	* ucnid.h: Regenerated.
	* uname2c.h: Regenerated.
2023-03-16 10:19:04 +01:00
Jakub Jelinek
83ffe9cde7 Update copyright years. 2023-01-16 11:52:17 +01:00
Jakub Jelinek
2662d537b0 libcpp: Update to Unicode 15
The following pseudo-patch regenerates the libcpp tables with Unicode 15.0.0
which added 4489 new characters.

As mentioned previously, this isn't just a matter of running the
two libcpp/make*.cc programs on the new Unicode files, but one needs
to manually update a table inside of makeuname2c.cc according to
a table in Unicode text (which is partially reflected in the text
files, but e.g. in Unicode 14.0.0 not 100% accurately, in 15.0.0
actually accurately).
I've also added some randomly chosen subset of those 4489 new
characters to a testcase.

2022-11-04  Jakub Jelinek  <jakub@redhat.com>

gcc/testsuite/
	* c-c++-common/cpp/named-universal-char-escape-1.c: Add tests for some
	characters newly added in Unicode 15.0.0.
libcpp/
	* makeuname2c.cc (struct generated): Update from Unicode 15.0.0
	table 4-8.
	* ucnid.h: Regenerated for Unicode 15.0.0.
	* uname2c.h: Likewise.
2022-11-04 18:18:42 +01:00
Jakub Jelinek
eb4879ab90 c++: Implement C++23 P2071R2 - Named universal character escapes [PR106648]
The following patch implements the
C++23 P2071R2 - Named universal character escapes
paper to support \N{LATIN SMALL LETTER E} etc.
I've used Unicode 14.0, there are 144803 character name properties
(including the ones generated by Unicode NR1 and NR2 rules)
and correction/control/alternate aliases, together with zero terminators
that would be 3884745 bytes, which is clearly unacceptable for libcpp.
This patch instead contains a generator which from the UnicodeData.txt
and NameAliases.txt files emits a space optimized radix tree (208765
bytes long for 14.0), a single string literal dictionary (59418 bytes),
maximum name length (currently 88 chars) and two small helper arrays
for the NR1/NR2 name generation.
The radix tree needs 2 to 9 bytes per node, the exact format is
described in the generator program.  There could be ways to shrink
the dictionary size somewhat at the expense of slightly slower lookups.

Currently the patch implements strict matching (that is what is needed
to actually implement it on valid code) and Unicode UAX44-LM2 algorithm
loose matching to provide hints (that algorithm essentially ignores
hyphens in between two alphanumeric characters, spaces and underscores
(with one exception for hyphen) and does case insensitive matching).
In the attachment is a WIP patch that shows how to implement also
spellcheck.{h,cc} style discovery of misspellings, but I'll need to talk
to David Malcolm about it, as spellcheck.{h,cc} is in gcc/ subdir
(so the WIP incremental patch instead prints all the names to stderr).

2022-08-26  Jakub Jelinek  <jakub@redhat.com>

	PR c++/106648
libcpp/
	* charset.cc: Implement C++23 P2071R2 - Named universal character
	escapes.  Include uname2c.h.
	(hangul_syllables, hangul_count): New variables.
	(struct uname2c_data): New type.
	(_cpp_uname2c, _cpp_uname2c_uax44_lm2): New functions.
	(_cpp_valid_ucn): Use them.  Handle named universal character escapes.
	(convert_ucn): Adjust comment.
	(convert_escape): Call convert_ucn even for \N.
	(_cpp_interpret_identifier): Handle named universal character escapes.
	* lex.cc (get_bidi_ucn): Fix up function comment formatting.
	(get_bidi_named): New function.
	(forms_identifier_p, lex_string): Handle named universal character
	escapes.
	* makeuname2c.cc: New file.  Small parts copied from makeucnid.cc.
	* uname2c.h: New generated file.
gcc/c-family/
	* c-cppbuiltin.cc (c_cpp_builtins): Predefine
	__cpp_named_character_escapes to 202207L.
gcc/testsuite/
	* c-c++-common/cpp/named-universal-char-escape-1.c: New test.
	* c-c++-common/cpp/named-universal-char-escape-2.c: New test.
	* c-c++-common/cpp/named-universal-char-escape-3.c: New test.
	* c-c++-common/cpp/named-universal-char-escape-4.c: New test.
	* c-c++-common/Wbidi-chars-25.c: New test.
	* gcc.dg/cpp/named-universal-char-escape-1.c: New test.
	* gcc.dg/cpp/named-universal-char-escape-2.c: New test.
	* g++.dg/cpp/named-universal-char-escape-1.C: New test.
	* g++.dg/cpp/named-universal-char-escape-2.C: New test.
	* g++.dg/cpp23/feat-cxx2b.C: Test __cpp_named_character_escapes.
2022-08-26 09:27:39 +02:00