Unicode
< Topics
< Development
< Computer Related
< jal
Get flash to fully experience Pearltrees
PEP Index > PEP 383 -- Non-decodable Bytes in System Character Interfaces File names, environment variables, and command line arguments are defined as being character data in POSIX; the C APIs however allow passing arbitrary bytes - whether these conform to a certain encoding or not. This PEP proposes a means of dealing with such irregularities by embedding the bytes in character strings in such a way that allows recreation of the original byte string. The C char type is a data type that is commonly used to represent both character data and bytes. Certain POSIX interfaces are specified and widely understood as operating on character data, however, the system call interfaces make no assumption on the encoding of these data, and pass them on as-is. With Python 3, character strings use a Unicode-based internal representation, making it difficult to ignore the encoding of byte strings in the same way that the C interfaces can ignore the encoding.
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 10: ordinal not in range(128) Never seen this exception? Seen it and sort of fixed it? This is a confusing error If you've never seen this before but want to write Python code, this talk is for you If you've seen this before and have no idea how to solve it, this talk is for you This is a really confusing error if you don't know what Python is trying to do for you; this talk aims to clarify The truth about strings in Python The magic of Unicode How to work with Unicode in Python 2 fundamental concept example code Glimpse at Unicode in Python 3 Ask lots of questions Corrections? handle non-English languages use 3rd party modules accept arbitrary text input you will love Unicode you will hate Unicode
by Markus Kuhn This text is a very comprehensive one-stop information resource on how you can use Unicode/UTF-8 on POSIX systems (Linux, Unix). You will find here both introductory information for every user, as well as detailed references for the experienced developer.
help | character | properties | confusables | unicode-set | compare-sets | regex | bnf-regex | breaks | transform | bidi | idna | languageid UnicodeSet UnicodeSets use regular-expression syntax to allow for arbitrary set operations (Union, Intersection, Difference) on sets of Unicode characters. The base sets can be specified explicitly, such as [a-m w-z] , or using Unicode Properties like [[:script=arabic:]&[:decomposit iontype=canonical:]] . The latter set gets the Arabic script characters that have a canonical decomposition. The properties can be specified either with Perl-style notation ( \p{script=arabic} ) or with POSIX-style notation ( [:script=arabic:] ).
Scripts | Symbols | Notes Find chart by code: Related links: Name index Help & links Scripts Symbols and Punctuation
This Web page contains lists of common special entity codes needed in HTML to generate special characters such as ñ, ¢, ÷ and other characters.