David Malcolm b050653c4c contrib: add unicode/utf8-dump.py
This script may be useful when debugging issues relating to Unicode
encoding (e.g. when investigating source files with bidirectional control
characters).

It dumps a UTF-8 file as a list of numbered lines (mimicking GCC's
diagnostic output format), interleaved with lines per character showing
the Unicode codepoints, the UTF-8 encoding bytes, the name of the
character, and, where printable, the characters themselves.
The lines are printed in logical order, which may help the reader to grok
the relationship between visual and logical ordering in bi-di files.

For example:

$ cat test.c
int གྷ;
const char *אבג = "ALEF-BET-GIMEL";

$ ./contrib/unicode/utf8-dump.py test.c
   1 | int གྷ;
     |   U+0069            0x69                     LATIN SMALL LETTER I i
     |   U+006E            0x6e                     LATIN SMALL LETTER N n
     |   U+0074            0x74                     LATIN SMALL LETTER T t
     |   U+0020            0x20                                    SPACE (separator)
     |   U+0F43  0xe0 0xbd 0x83                       TIBETAN LETTER GHA གྷ
     |   U+003B            0x3b                                SEMICOLON ;
     |   U+000A            0x0a                           LINE FEED (LF) (control character)
   2 | const char *אבג = "ALEF-BET-GIMEL";
     |   U+0063            0x63                     LATIN SMALL LETTER C c
     |   U+006F            0x6f                     LATIN SMALL LETTER O o
     |   U+006E            0x6e                     LATIN SMALL LETTER N n
     |   U+0073            0x73                     LATIN SMALL LETTER S s
     |   U+0074            0x74                     LATIN SMALL LETTER T t
     |   U+0020            0x20                                    SPACE (separator)
     |   U+0063            0x63                     LATIN SMALL LETTER C c
     |   U+0068            0x68                     LATIN SMALL LETTER H h
     |   U+0061            0x61                     LATIN SMALL LETTER A a
     |   U+0072            0x72                     LATIN SMALL LETTER R r
     |   U+0020            0x20                                    SPACE (separator)
     |   U+002A            0x2a                                 ASTERISK *
     |   U+05D0       0xd7 0x90                       HEBREW LETTER ALEF א
     |   U+05D1       0xd7 0x91                        HEBREW LETTER BET ב
     |   U+05D2       0xd7 0x92                      HEBREW LETTER GIMEL ג
     |   U+0020            0x20                                    SPACE (separator)
     |   U+003D            0x3d                              EQUALS SIGN =
     |   U+0020            0x20                                    SPACE (separator)
     |   U+0022            0x22                           QUOTATION MARK "
     |   U+0041            0x41                   LATIN CAPITAL LETTER A A
     |   U+004C            0x4c                   LATIN CAPITAL LETTER L L
     |   U+0045            0x45                   LATIN CAPITAL LETTER E E
     |   U+0046            0x46                   LATIN CAPITAL LETTER F F
     |   U+002D            0x2d                             HYPHEN-MINUS -
     |   U+0042            0x42                   LATIN CAPITAL LETTER B B
     |   U+0045            0x45                   LATIN CAPITAL LETTER E E
     |   U+0054            0x54                   LATIN CAPITAL LETTER T T
     |   U+002D            0x2d                             HYPHEN-MINUS -
     |   U+0047            0x47                   LATIN CAPITAL LETTER G G
     |   U+0049            0x49                   LATIN CAPITAL LETTER I I
     |   U+004D            0x4d                   LATIN CAPITAL LETTER M M
     |   U+0045            0x45                   LATIN CAPITAL LETTER E E
     |   U+004C            0x4c                   LATIN CAPITAL LETTER L L
     |   U+0022            0x22                           QUOTATION MARK "
     |   U+003B            0x3b                                SEMICOLON ;
     |   U+000A            0x0a                           LINE FEED (LF) (control character)

Tested with Python 3.8

contrib/ChangeLog:
	* unicode/utf8-dump.py: New file.

Signed-off-by: David Malcolm <dmalcolm@redhat.com>
2021-11-01 11:52:28 -04:00
2021-10-27 00:16:33 +00:00
2021-09-20 00:16:21 +00:00
2021-11-01 11:52:28 -04:00
2021-08-31 00:16:50 +00:00
2021-10-23 00:16:26 +00:00
2021-09-22 00:16:28 +00:00
2021-09-28 00:16:21 +00:00
2021-06-15 00:16:37 +00:00
2021-10-23 00:16:26 +00:00
2021-07-22 00:16:46 +00:00
2021-10-23 00:16:26 +00:00
2021-08-18 00:16:48 +00:00
2021-10-23 00:16:26 +00:00
2021-10-28 00:16:39 +00:00
2021-10-28 00:16:39 +00:00
2021-10-19 00:16:23 +00:00
2021-10-31 00:16:24 +00:00
2021-10-23 00:16:26 +00:00
2021-06-18 00:16:58 +00:00
2021-01-06 00:16:55 +00:00
2021-10-20 00:16:43 +00:00
2021-11-01 00:16:20 +00:00
2021-06-09 00:16:30 +00:00
2021-10-09 00:16:26 +00:00
2021-01-06 00:16:55 +00:00
2021-01-06 00:16:55 +00:00
2021-09-14 00:16:23 +00:00
2021-05-15 00:16:27 +00:00
2021-01-06 00:16:55 +00:00
2021-06-24 16:51:40 +05:30
2021-10-29 00:16:37 +00:00
2021-01-05 16:04:14 -07:00
2021-10-26 16:28:04 -04:00
2021-10-22 15:43:50 -07:00

This directory contains the GNU Compiler Collection (GCC).

The GNU Compiler Collection is free software.  See the files whose
names start with COPYING for copying permission.  The manuals, and
some of the runtime libraries, are under different terms; see the
individual source files for details.

The directory INSTALL contains copies of the installation information
as HTML and plain text.  The source of this information is
gcc/doc/install.texi.  The installation information includes details
of what is included in the GCC sources and what files GCC installs.

See the file gcc/doc/gcc.texi (together with other files that it
includes) for usage and porting information.  An online readable
version of the manual is in the files gcc/doc/gcc.info*.

See http://gcc.gnu.org/bugs/ for how to report bugs usefully.

Copyright years on GCC source files may be listed using range
notation, e.g., 1987-2012, indicating that every year in the range,
inclusive, is a copyrightable year that could otherwise be listed
individually.
Description
No description provided
Readme 1.5 GiB
Languages
C++ 30.7%
C 30%
Ada 14.5%
D 6.1%
Go 5.7%
Other 12.5%