Unicode

TwitterFacebook
Get flash to fully experience Pearltrees

PEP 383 -- Non-decodable Bytes in System Character Interfaces

http://www.python.org/dev/peps/pep-0383/ Post-History: File names, environment variables, and command line arguments are defined as being character data in POSIX; the C APIs however allow passing arbitrary bytes - whether these conform to a certain encoding or not. This PEP proposes a means of dealing with such irregularities by embedding the bytes in character strings in such a way that allows recreation of the original byte string. The C char type is a data type that is commonly used to represent both character data and bytes. Certain POSIX interfaces are specified and widely understood as operating on character data, however, the system call interfaces make no assumption on the encoding of these data, and pass them on as-is. With Python 3, character strings use a Unicode-based internal representation, making it difficult to ignore the encoding of byte strings in the same way that the C interfaces can ignore the encoding.

Unicode In Python, Completely Demystified

pretend you opened this in a desktop text editor (nothing fancy like vi) and you saved it in UTF-8 format. This might not have been the default. >>> ivan_uni u'Ivan Krsti \u0107 ' >>> f = open ( '/tmp/ivan.txt' , 'w' ) >>> f . write(ivan_uni) Traceback (most recent call last): ... UnicodeEncodeError : 'ascii' codec can 't encode character u' \u0107 ' in position 10: ordinal not in range(128) >>> ivan_uni u'Ivan Krsti \u0107 ' >>> f = open ( '/tmp/ivan.txt' , 'w' ) >>> import sys >>> f . write(ivan_uni . encode( ... sys . getdefaultencoding())) ... Traceback (most recent call last): ... http://farmdev.com/talks/unicode/
by Markus Kuhn This text is a very comprehensive one-stop information resource on how you can use Unicode/UTF-8 on POSIX systems (Linux, Unix). You will find here both introductory information for every user, as well as detailed references for the experienced developer. http://www.cl.cam.ac.uk/~mgk25/unicode.html

UTF-8 and Unicode FAQ

http://unicode.org/cldr/utility/index.jsp

Utilities: Description and Index

help | character | properties | confusables | unicode-set | compare-sets | regex | bnf-regex | breaks | transform | bidi | idna | languageid You'll then see the modified pattern. It will often be much larger, but any reasonable Regex engine will compile character classes reasonably. Below that, you'll see a sample of how the expression works, using it to find substrings of the sample text and underline them. If you click on any property value in either of these two windows, like 4.0.0.0 for Age, you'll see the characters with that property in the UnicodeSets Demo window UnicodeSet Demo window
http://www.evertype.com/

Evertype

Lingua Franca Nova (LFN) es un lingua aidante con un gramatica simple, creolin, e lojical. Lo ia es creada par Dr C. George Boeree de la Universia de Shippensburg, Penn­sylvania, comensante en 1965. Inspirada par la Lingua Franca istorial usada sirca la Mediteraneo, lo prende se vocabulo de catalan, espaniol, franses, italian, e portuges. En 1998, LFN ia es publicida a la interede, e se parlores ia continua developa e boni la lingua tra la anios seguente.
http://www.unicode.org/charts/

Code Charts

BMP , Plane 1 , Plane 2 , Plane 3 , Plane 4 , Plane 5 , Plane 6 , Plane 7 , Plane 8 , Plane 9 , Plane 10 , Plane 11 , Plane 12 , Plane 13 , Plane 14 , Plane 15 , Plane 16 To get a list of code charts for a character, enter its code in the search box at the top. To access a chart for a given block, click on its entry in the table. The charts are PDF files, and some of them may be very large.
http://www.unicode.org/

Unicode Consortium

Welcome! The Unicode Consortium enables people around the world to use computers in any language. Our freely-available specifications and data form the foundation for software internationalization in all major operating systems, search engines, applications, and the World Wide Web. An essential part of our mission is to educate and engage academic and scientific communities, and the general public.

Encoding Tutorial: Unicode

Although multiple encoding standards have been developed and implemented for multiple scripts, ideally it would be nice if there were one super encoding scheme which covered all the scripts in the world in a standard fashion. Unicode ( www.unicode.org ) is a global encoding scheme which seeks to include all characters in all scripts in one super global encoding system. Unicode 4 includes most current national scripts and many CJK characters, but the most recent standards may not be incorporated into all software packages. The most recent operating systems support Unicode, although not all software does. Font and software support for Unicode is still being developed, but you can see some Unicode test pages are at: http://tlt.its.psu.edu/suggestions/international/web/encoding/07unicode.html
http://tlt.its.psu.edu/suggestions/international/web/encoding/02ascii.html Standing for "American Standard Code for Information Interchange", this was the first attempt to provide a character exchange standard. When it was invented in the 1960's, computing limitations limited the set to 2^ 7 or 128 characters. For more information on how ASCII was developed, you can read this article at CNN.com .

Penn State Computing with Foreign Symbols

http://tlt.its.psu.edu/suggestions/international/web/encoding/03eightbit.html To increase the number of characters encoded, vendors doubled the range of ASCII to 256 (2 8 ) characters. This became known as " 8-bit encoding ". The usual structure is: Crucially, each combination of an a letter plus a different accent forms a separate character or code point. For instance, á, â, à, Á, Â, and À are assigned six different numbers in 8-bit encoding. Unfortunately, not all vendors used the same 8-bit encoding.

Penn State Computing with Foreign Symbols

Encoding on the Internet (Penn State)

Much of how browsers interpret foreign language Web sites is dependent on how text is numerically encoded on the Internet. Understanding a little bit about encoding can help you develop foreign language web sites properly. Much of the material in this tutorial was pulled from the following references.
This Web page contains lists of common special entity codes needed in HTML to generate special characters such as ñ, ¢, ÷ and other characters. Full instructions are in the "Using the Codes" section followed by lists organized by character type. Information on NOTE: If you are composing Web pages in an HTML editor such as Dreamweaver or Microsoft Web Expression the programs may generate the characters based on what is typed in (check the HTML to be sure).

HTML Accent Entity Codes