background preloader

String Enconding (UTF-8 included)

Facebook Twitter

Unicode and Character Sets. Unicode Table. Unicode character table. Unicode/UTF-8-character table. Java - JSF chars get double UTF-8 encoded. A rough guide to character encoding. It can be tricky figuring out the difference between character handling code that works and code that just appears to work because testing did not encounter cases that exposed bugs. This is a post about some of the pitfalls of character handling in Java.

Topics: I wrote a little bit about Unicode before. This post might be exhausting, but it isn't exhaustive. Unicode in source files Java source files include support for Unicode. There are two common mechanisms for writing code that includes a range of Unicode characters. One choice is to encode the source files as Unicode, write the characters and inform the compiler at compile time. javac provides the -encoding <encoding> option for this. Code saved as UTF-8, as might be written on an Ubuntu machine: public class PrintCopyright { public static void main(String[] args) { System.out.println("© Acme, Inc. "); }} 1. Javac -encoding UTF-8 PrintCopyright.java 2. Javac -encoding Cp1252 PrintCopyright.java Unicode and Java data types 1. 2. 3. 4.

Output: Non-UTF-8 encoding in ZIP file (Xueming Shen's Oracle Blog) The Zip specification (historically) does not specify what character encoding to be used for the embedded file names and comments, the original IBM PC character encoding set, commonly referred to as IBM Code Page 437, is supposed to be the only encoding supported. Jar specification meanwhile explicitly specifies to use UTF-8 as the encoding to encode and decode all file names and comments in Jar files. Our java.util.jar and java.util.zip implementation therefor strictly followed Jar specification to use UTF-8 as the sole encoding when dealing with the file names and comments stored in Jar/Zip files.

Consequence? The ZIP file created by "traditional" ZIP tool is not accessible for java.util.jar/zip based tool, and vice versa, if the file name contains characters that are not compatible between Cp437 (as an alternative, tools might simply use the default platform encoding) and UTF-8 Something you might want to keep in mind when use these new APIs and the new JDK7 bundles. Enjoy the APIs! A rough guide to character encoding. It can be tricky figuring out the difference between character handling code that works and code that just appears to work because testing did not encounter cases that exposed bugs.

This is a post about some of the pitfalls of character handling in Java. Topics: I wrote a little bit about Unicode before. This post might be exhausting, but it isn't exhaustive. Unicode in source files Java source files include support for Unicode. There are two common mechanisms for writing code that includes a range of Unicode characters. One choice is to encode the source files as Unicode, write the characters and inform the compiler at compile time. javac provides the -encoding <encoding> option for this. Code saved as UTF-8, as might be written on an Ubuntu machine: public class PrintCopyright { public static void main(String[] args) { System.out.println("© Acme, Inc. "); }} 1. Javac -encoding UTF-8 PrintCopyright.java 2.

Javac -encoding Cp1252 PrintCopyright.java Unicode and Java data types 1. 2. 3. 4. Output: