Unicode

TwitterFacebook
Get flash to fully experience Pearltrees

PEP 383 -- Non-decodable Bytes in System Character Interfaces

http://www.python.org/dev/peps/pep-0383/ PEP Index > PEP 383 -- Non-decodable Bytes in System Character Interfaces File names, environment variables, and command line arguments are defined as being character data in POSIX; the C APIs however allow passing arbitrary bytes - whether these conform to a certain encoding or not. This PEP proposes a means of dealing with such irregularities by embedding the bytes in character strings in such a way that allows recreation of the original byte string. The C char type is a data type that is commonly used to represent both character data and bytes. Certain POSIX interfaces are specified and widely understood as operating on character data, however, the system call interfaces make no assumption on the encoding of these data, and pass them on as-is. With Python 3, character strings use a Unicode-based internal representation, making it difficult to ignore the encoding of byte strings in the same way that the C interfaces can ignore the encoding.

Unicode In Python, Completely Demystified

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 10: ordinal not in range(128) Never seen this exception? Seen it and sort of fixed it? This is a confusing error If you've never seen this before but want to write Python code, this talk is for you If you've seen this before and have no idea how to solve it, this talk is for you This is a really confusing error if you don't know what Python is trying to do for you; this talk aims to clarify The truth about strings in Python The magic of Unicode How to work with Unicode in Python 2 fundamental concept example code Glimpse at Unicode in Python 3 Ask lots of questions Corrections? handle non-English languages use 3rd party modules accept arbitrary text input you will love Unicode you will hate Unicode http://farmdev.com/talks/unicode/
by Markus Kuhn This text is a very comprehensive one-stop information resource on how you can use Unicode/UTF-8 on POSIX systems (Linux, Unix). You will find here both introductory information for every user, as well as detailed references for the experienced developer. http://www.cl.cam.ac.uk/~mgk25/unicode.html

UTF-8 and Unicode FAQ

http://unicode.org/cldr/utility/index.jsp

Utilities: Description and Index

help | character | properties | confusables | unicode-set | compare-sets | regex | bnf-regex | breaks | transform | bidi | idna | languageid UnicodeSet UnicodeSets use regular-expression syntax to allow for arbitrary set operations (Union, Intersection, Difference) on sets of Unicode characters. The base sets can be specified explicitly, such as [a-m w-z] , or using Unicode Properties like [[:script=arabic:]&[:decomposit iontype=canonical:]] . The latter set gets the Arabic script characters that have a canonical decomposition. The properties can be specified either with Perl-style notation ( \p{script=arabic} ) or with POSIX-style notation ( [:script=arabic:] ).
http://www.unicode.org/charts/

Code Charts

Scripts | Symbols | Notes Find chart by code: Related links: Name index Help & links Scripts Symbols and Punctuation
http://symbolcodes.tlt.psu.edu/web/codehtml.html This Web page contains lists of common special entity codes needed in HTML to generate special characters such as ñ, ¢, ÷ and other characters.

HTML Accent Entity Codes