Character encoding - How to safelycompare UTF-8 to ISO 8859-1 (latin1) in PHP. Php - htmlentities, htmlspecialchars, and "invalid multibyte sequence" Php - How to json_encode array with french accents. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) By Joel Spolsky Wednesday, October 08, 2003 Ever wonder about that mysterious Content-Type tag?
You know, the one you're supposed to put in HTML and you never quite know what it should be? Did you ever get an email from your friends in Bulgaria with the subject line "???? ?????? ??? I've been dismayed to discover just how many software developers aren't really completely up to speed on the mysterious world of character sets, encodings, Unicode, all that stuff. But it won't. So I have an announcement to make: if you are a programmer working in 2003 and you don't know the basics of characters, character sets, encodings, and Unicode, and I catch you, I'm going to punish you by making you peel onions for 6 months in a submarine. And one more thing: In this article I'll fill you in on exactly what every working programmer should know.
A Historical Perspective The easiest way to understand this stuff is to go chronologically. And all was good, assuming you were an English speaker. Unicode Hello Next: What are the character encodings UTF-8 and ISO-8859-1 rules. Php - Json_encode Charset problem. Have you met  ? Say hello to my BOM. Byte order mark. The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. It is encoded at U+FEFF byte order mark (BOM). BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.[1] Because Unicode can be encoded as 16-bit or 32-bit integers, a computer receiving these encodings from arbitrary sources needs to know which byte order the integers are encoded in.
The BOM gives the producer of the text a way to describe the text stream's endianness to the consumer of the text without requiring some contract or metadata outside of the text stream itself. Usage[edit] If the BOM character appears in the middle of a data stream, Unicode says it should be interpreted as a "zero-width non-breaking space" (inhibits line-breaking between word-glyphs). UTF-8[edit] UTF-16[edit] ISO/IEC 8859-1. ISO-8859-1 is the IANA preferred name for this standard when supplemented with the C0 and C1 control codes from ISO/IEC 6429. The following other aliases are registered for ISO-8859-1: iso-ir-100, csISOLatin1, latin1, l1, IBM819, CP819. Coverage[edit] Each character is encoded as a single eight-bit code value.
These code values can be used in almost any data interchange system to communicate in the following European languages (with a few exceptions due to missing characters, as noted): Languages with complete coverage[edit] Languages commonly supported but with incomplete coverage[edit] Quotation marks[edit] For some languages listed above the correct typographical quotation marks are missing, as only « », " ", and ' ' are included. History[edit] ISO 8859-1 was based on the Multinational Character Set used by Digital Equipment Corporation in the popular VT220 terminal.
In 1985 Commodore adopted ISO 8859-1 for its new AmigaOS operating system. Codepage layout[edit] Similar character sets[edit] Php - Elegant way to search for UTF-8 files with BOM. .net - XML - Data At Root Level is Invalid. How to detect BOM in file?