background preloader

Unicode

Facebook Twitter

Case study: porting chardet to Python 3 - Dive Into Python 3. Strings - Dive Into Python 3. Files - Dive Into Python 3. Non-decodable Bytes in System Character Interfaces. PEP Index> PEP 383 -- Non-decodable Bytes in System Character Interfaces File names, environment variables, and command line arguments are defined as being character data in POSIX; the C APIs however allow passing arbitrary bytes - whether these conform to a certain encoding or not.

Non-decodable Bytes in System Character Interfaces

This PEP proposes a means of dealing with such irregularities by embedding the bytes in character strings in such a way that allows recreation of the original byte string. The C char type is a data type that is commonly used to represent both character data and bytes. Certain POSIX interfaces are specified and widely understood as operating on character data, however, the system call interfaces make no assumption on the encoding of these data, and pass them on as-is. With Python 3, character strings use a Unicode-based internal representation, making it difficult to ignore the encoding of byte strings in the same way that the C interfaces can ignore the encoding. Unicode In Python, Completely Demystified. UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 10: ordinal not in range(128) Never seen this exception? Seen it and sort of fixed it? This is a confusing errorIf you've never seen this before but want to write Python code, this talk is for youIf you've seen this before and have no idea how to solve it, this talk is for youThis is a really confusing error if you don't know what Python is trying to do for you; this talk aims to clarify The truth about strings in PythonThe magic of UnicodeHow to work with Unicode in Python 2fundamental conceptexample codeGlimpse at Unicode in Python 3Ask lots of questionsCorrections?

Handle non-English languagesuse 3rd party modulesaccept arbitrary text inputyou will love Unicodeyou will hate Unicode [form input] => [Python] => [HTML] accepts input as textwrites text to an html file [read from DB] => [Python] => [write to DB] accepts input as textwrites text to the database [text files] => [Python] => [stdout] Ivan Krstić a string of bytes! UTF-8 and Unicode FAQ. By Markus Kuhn This text is a very comprehensive one-stop information resource on how you can use Unicode/UTF-8 on POSIX systems (Linux, Unix).

UTF-8 and Unicode FAQ

You will find here both introductory information for every user, as well as detailed references for the experienced developer. Unicode now replaces ASCII, ISO 8859 and EUC at all levels. It enables users to handle not only practically any script and language used on this planet, it also supports a comprehensive set of mathematical and technical symbols to simplify scientific information exchange. With the UTF-8 encoding, Unicode can be used in a convenient and backwards compatible way in environments that were designed entirely around ASCII, like Unix. Contents What are UCS and ISO 10646? The international standard ISO 10646 defines the Universal Character Set (UCS).

UCS contains the characters required to represent practically all known languages. ISO 10646 originally defined a 31-bit character set. The full reference for the UCS standard is Level 1. Utilities: Description and Index. Evertype. Unicode Character Ranges. Code Charts. Specials Controls: C0, C1 Layout Controls Invisible Operators Specials Tags Variation Selectors Variation Selectors Supplement Private Use Private Use Area Supplementary Private Use Area-A Supplementary Private Use Area-B Surrogates High Surrogates Low Surrogates Noncharacters in Charts Noncharacters in blocks Range in Arabic Presentation Forms-A Range in Specials Noncharacters at end of ...

Code Charts

BMP, Plane 1, Plane 2, Plane 3, Plane 4, Plane 5, Plane 6, Plane 7, Plane 8, Plane 9, Plane 10, Plane 11, Plane 12, Plane 13, Plane 14, Plane 15, Plane 16. Unicode Consortium.