Codepoints. Encodings In Strings Are Evil Things (Part 1) Encodings in Strings are Evil Things (Part 2) At the end of the last post, we reduced the abstract concept of "string" down to an "ordered sequence of Unicode code points. " (We did so by choosing to actively ignore glyph information, but we'll be coming back to it later.) Unicode code points are simply numbers; of course, numbers have to be reduced to binary to be stored in a computer. And someone who is reading a string from a file, or from memory, needs to use the exact same encoding scheme, or we're off in la-la land. And not all encodings are equal. First off, the simplest route. There are 231 possible Unicode code points, and an x86 register is 32 bits wide, so let's just add a zero and encode everything as a 32-bit unsigned binary!
Now, the ISO-10646 guys recognized that the majority of the written languages used on the Internet today can be expressed using a tiny subset of the 231 symbols, and it seems a waste to use four bytes for every character if the high bytes are 0 most of the time. Later on, UTF-32 was introduced. Encodings in Strings are Evil Things (Part 3) (Before I start: I've gotten a few suggestions about readability, since my two entries thus far have been quite long.
So, entries will now contain a summary at the end with major facts/conclusions, and I'll go back and add them for the first two posts. I'll also try to pace my paragraphs more regularly. Thanks for the advice!) Yesterday, we took the definition of string as an ordered sequence of Unicode code points, and explored various schemes for encoding and decoding code point indices on a binary computer. At the end, we had a new definition for string -- a stream of bits, and some type of information identifying the encoding scheme used to interpret the bits as a stream of Unicode code points. Today, since I'm a coder, we'll be starting a C++ implementation of a string library based on this definition. Before we do that, though, there's one more nasty digression into standards-land that I'd like to take. The loss of direct compatibility with stringbuf is a big pain. Encodings in Strings are Evil Things (Part 4)
In our last episode, we established that we wouldn't be able to make a true std::string replacement and still handle variable-width encodings. So, we started with the beginning lines of an rmstring class. However, this doesn't mean we are going to dispense with std::string entirely! But first, a quick answer about my choice of names and an explanation about exceptions. A friend of mine asked me yesterday, "Don't you intend to make a basic_rmstring and then have a typedef'd rmstring that hardwires a specific specialization, like ASCII?
" In a dream world, we would typedef a partial specialization. Template <class Enc> struct rmstring { typedef basic_rmstring<Enc, vector_of_bytes> type;};rmstring<iso8859_1>::type str; Really, both of them are pretty damned ugly; the preprocessor approach is prettier, IMHO, but is also considerably more dangerous.
There's something to keep in mind, though. The other exception, malformed_data, comes from if we try to decode a buffer that has an error in it. Encodings in Strings are Evil Things (Part 5) In our last episode, we briefly discussed possible behaviors for encoding_cast, and we discussed how the STL's basic_string class was structured -- namely, we noted that it had several core functions that were overloaded many times for various types of input. We also noted that we could avoid many of the implementation headaches that result, because of our decision to generalize our backing store. One of my coworkers pointed out that Herb Sutter had already done an excellent dissection of basic_string in Exceptional C++ Style -- and, indeed, the last four chapters of the book are spent analyzing its structure, breaking it down to the core functions, and then implementing many of the functions and overloads as non-member template functions.
However, he's not looking to improve basic_string's foundation -- he's merely explaining how reducing the number of methods in basic_string makes the code much easier to maintain. Before, we listed the methods that seemed worthwhile to carry over. Encodings in Strings are Evil Things (Part 6) First, I apologize for not updating recently -- at work, my dev machine's power supply died, and took my hard drive with it.
Luckily, I had everything backed up; however, I had to copy everything over to, and work on, a single-monitor Longhorn dogfood box with no major apps installed. This went on for a week and a half while I waited for Dell to slog through the warranty process for new parts and have them installed by a Dell-authorized tech (in order to keep the warranty going) and this put me behind schedule for several deadlines. So, now that my dev machine has a new PSU and HDD I've been frantically working to get caught up on things, and this has left little time for the blog. In about two weeks these deadlines will be behind me, and I can start posting with regularity again. My solution is to return a proxy object, MultiByteChar.
When I initially decided on this, one of my coworkers pointed out that I would run into the same problem as vector<bool>. Encodings in Strings are Evil Things (Part 7) Eugh. Due to a three-part punch of piling-up work, time with family over the holidays, and being thoroughly sick, I haven't had much time to work on rmstring -- which means, of course, that this hasn't updated. I haven't given up on it though! (I'm not dead! I don't want to go on the cart...) So, on to business. One annoyance that I've found is pointer type conversions; imagine that you've allocated a byte array for recv()ing something in from a TCP socket. Thus, I've opted for the simplest solution: a huge comment in the code that says "These functions assume that the backing store's data() pointer is suitably aligned for Stride-sized accesses and that size() is a multiple of Stride's size.
I've also had occasion to rethink my plans for encoding_cast. With the above, the originally envisioned encoding_cast is now just syntactic sugar for a call to the source string's member transcode() function. (Since this was mostly a "what happened while I was gone" article, no point summary.) Encodings in Strings are Evil Things (Part 8) As more Unicode encodings are being finished, I find myself wanting to actually start using rmstring in real situations.
However, most of my "real situations" involve legacy encodings. So, I need to start cracking on transcoding. The first concern is allowing adapters for arbitrary transcodings. A tricky problem that's related to transcoding is collation (aka sorting) -- most people aren't aware that sorting strings is often a locale-dependent issue. This is a localization problem.
In the case of sorting, a binary sort is often not enough. Where do accented characters sort -- the same as their base characters, or after? For this reason, developers using rmstring on Win32 platforms will almost certainly want to use a sorting predicate based on Win32's CompareString or LCMapString APIs. Anyways, similar issues arise for transcoding.
These functions now put off transcoding to the Engine object, whatever that may be. I've been working on the DFA for the last few days. Unicode Consortium. The Unicode Consortium Discussion Forum.
Shaping Engines. Scripts. Richard Ishida. PUA. CDL Character Description Language. On “CJK Unified Ideograph”: an apology In Unicode/ISO parlance, certain blocks of 漢 Hàn characters are called “CJK Unified Ideographs”. CJK (a trademark of the RLG) stands for “Chinese, Japanese, and Korean”, and is sometimes extended to CJKV “Chinese, Japanese, Korean and Vietnamese” (and it could be extended further, to include all IRG contributors).
Scripts in all of these locales make use of CJKV (Chinese-derived) characters. These characters are “Chinese-derived” in that the principles for character creation originated in China (more than 3,000 years ago). These characters are sometimes also termed 漢 (“Hàn” as in the name of Unicode’s Hàn database [a.k.a. Like Hàn, the term ideograph (sometimes also [mis-]written “ideogram”) is today used in information-technology (info-tech) circles to signify ‘the uniquely CJKV script entity’, which is to say, “CJKV ideographs” constitute a certain subset of the “characters” to be found in Asian texts. In terms of a “character” vs.