background preloader

Encoding

Facebook Twitter

Anti-cyber-bullying mobile movies. SUPER © . Unicode and HTML. Web pages authored using hypertext markup language (HTML) may contain multilingual text represented with the Unicode universal character set.

Unicode and HTML

Key to the relationship between Unicode and HTML is the relationship between the "document character set" which defines the set of characters that may be present in a HTML document and assigns numbers to them and the "external character encoding" or "charset" used to encode a given document as a sequence of bytes. In RFC 1866, the initial HTML 2.0 standard, the document character set was defined as ISO-8859-1. It was extended to ISO 10646 (which is basically equivalent to Unicode) by RFC 2070.

It does not vary between documents of different languages or created on different platforms. The external character encoding is chosen by the author of the document (or the software the author uses to create the document) and determines how the bytes used to store and/or transmit the document map to characters from the document character set. See also[edit] On the use of some MS Windows characters in HTML. The so-called MS Windows character set, or Windows Latin 1, contains, in addition to ISO Latin 1 (ISO 8859-1) characters, some special characters like em dash, trademark symbol, and asymmetric quote characters.

On the use of some MS Windows characters in HTML

A Web author who works in a Windows environment may not realize that by using such characters he creates problems to some users. Typically, if an author naively types a trademark symbol, a browser running on Unix or some other non-Windows system may display a blank instead of the trademark symbol, or something worse. This document explains this problem in some detail and outlines various solutions. The following characters are still somewhat risky in HTML documents: The same applies to euro sign, as well as to Z and z with caron, with the additional note that since they are additions to the original MS Windows character set, they have caused even more problems than the others.

Content The nature of the problems There is nothing wrong with the characters discussed here. A tutorial on character code issues. This document tries to clarify the concepts of character repertoire, character code, and character encoding especially in the Internet context.

A tutorial on character code issues

It specifically avoids the term character set, which is confusingly used to denote repertoire or code or encoding. ASCII, ISO 646, ISO 8859 (ISO Latin, especially ISO Latin 1), Windows character set, ISO 10646, UCS, and Unicode, UTF-8, UTF-7, MIME, and QP are used as examples. This document in itself does not contain solutions to practical problems with character codes (but see section Further reading). Rather, it gives background information needed for understanding what solutions there might be, what the different solutions do - and what's really the problem in the first place.

If you are looking for some quick help in using a large character repertoire in HTML authoring, see the document Using national and special characters in HTML. The basics octet is a small unit of data with a numerical value between 0 and 255, inclusively. Bytes string code . Internet Movie Database.