background preloader

Scripts

Facebook Twitter

The Alphabets of Europe. Michael Everson Version 3.0 Contents 0 Introduction1 Scope1.1 The geographical area of Europe1.2 The languages of Europe1.3 The scripts of Europe2.0 Definitions3.0 Criteria for evaluation4.0 Writing systems5.0 Repertoires 5.1 Punctuation 5.2 Digits and numbers Annex A Administrative units of Europe Annex B European Sign Languages Annex C Changes from previous versions 0 Introduction The Alphabets of Europe provides a source of linguistic data for the indigenous languages of Europe.

The Alphabets of Europe

The use of the term indigenous (or autochthonous) indicates that this report covers the languages native to the European geographical area. The main function of these pages is to present a catalogue of European alphabets. The Alphabets of Europe could not have been compiled without the input of many, many people, and the difficult nature of the material presented here begs for explicit acknowledgement of the abundant expertise which has been contributed. BabelStone. Unicode Character Names Part 1.

The one thing about Unicode that really seems to bug people more than anything else is that the character names are not always perfect, are sometimes misleading, and in a few cases are just plain wrong.

Unicode Character Names Part 1

All Unicode characters have an official name which is used to uniquely identify them (but see Note 1 below the table). The 71,226 CJK ideographs have algorithmically derived names based on their code point (e.g. CJK UNIFIED IDEOGRAPH-4E00 for U+4E00), and the 11,172 Hangul syllables have algorithmically derived names based on their phonetic composition (e.g. HANGUL SYLLABLE GAH for U+AC1B, which is composed of the three jamo letters G, A and H). The remaining 15,257 characters have hand-crafted names, and it is perhaps not suprising that a few mistakes have crept in from time to time. Unicode Character Names Part 2. As discussed in my post on Good.

Unicode Character Names Part 2

Bad and Ugly Character Names, there are some Unicode characters with wrong or misleading names. Some people get very worked up about bad character names (or names that they perceive to be bad), and insist that Unicode must change the name. However, for reasons of stability with other standards, which may refer to Unicode characters by name rather than code point, character names once assigned cannot under any circumstances be changed. Nevertheless, the names for 1,944 characters introduced in Unicode 1.0 are different from their current names (in the vast majority of cases the changes are very minor), but these name changes were required by the merger between the developing Unicode and ISO/IEC 10646 standards in 1993.

Unicode Character Names Part 3. As discussed in Part 1 there are some unfortunately misnamed characters, and as discussed in Part 2 Unicode character names once assigned can never be changed, and so misnamed characters are stuck with their names whether they like it or not.

Unicode Character Names Part 3

Whilst the characters themselves may or may not mind what they are called, characters with wrong names cause untold anguish to some people. Until now there has not been very much that can be done about the problem, and if anyone complains about a particular character name all that Unicode can advise them is that character names are intended as unique mnemonic identifers and should not be relied on for identification of a character's function or meaning — which is unfortunate as most character names can be relied on for this purpose.

There are two important points to be made about these formal aliases. Secondly, formal aliases are completely different to the aliases already provided in the Unicode code charts. Precomposed Tibetan Part 1. This post really ought to have been Part 3 of a History of Tibetan Encoding in Unicode, but Michael Kaplan's recent posts on the proposed alternative syllabic encoding of Tamil here and here have encouraged me to take a look at the latest twist in the saga of Tibetan encoding before I visit its early history of false starts and lost opportunities.

Precomposed Tibetan Part 1

Tibetan is not a difficult script to read or write, but it is a very complex script to deal with in terms of computer processing (as far as complexity goes I would rate it second only to the Mongolian script). The problem is that written Tibetan comprises complex syllable units (known in Tibetan as a tsheg bar ཚེག་བར) which although written horizontally may include vertical clusters of consonants and vowel signs agglutinating around a base consonant (a vertical cluster is known as a "stack"). Thus most words have a horizontal and a vertical dimension, with the result that text is not laid out in a straight line as in most scripts.

Precomposed Tibetan Part 2. As discussed in Part 1, in 2002-2003 China tried and failed to get nearly a thousand precomposed Tibetan characters encoded in ISO/IEC 10646 (which is the international standard corresponding to Unicode).

Precomposed Tibetan Part 2

Following on from this humiliating defeat, in April of 2004 Joe Zhang (Zhang Zhoucai 张轴材), formerly a contributing editor of ISO/IEC 10646, presented to a conference in China a paper that outlined a new Chinese encoding standard for Tibetan, codenamed the "Everest Scheme". This scheme utilizes the Private Use Areas (PUA) of the UCS to encode several thousand precomposed Tibetan characters, and was characterised as a "national standard within the framework of an international standard". Phags-pa Fonts 1. Part of the 1345 Phags-pa Sanskrit inscription at Juyongguan To celebrate the recent encoding of the Phags-pa script in Unicode 5.0 I am releasing some free OpenType Unicode Phags-pa fonts that fully implement Phags-pa shaping behaviour (for which see Section 10.3 of The Unicode Standard version 5.0).

Phags-pa Fonts 1

These fonts work correctly under Windows XP and later, although I have had to set the script tag in the OT tables to <latn> rather than <phag> in order to trick Uniscribe into applying the font's OpenType features [insert long and bitter anti-Uniscribe rant here]. I'm afraid that I have no idea whether these fonts will work on platforms other than Windows.

Phags-pa Shaping Behaviour. Phags-pa Alternate Letters YA, SHA, HA and FA. After having discussed the basics of Phags-pa Shaping Behaviour, I thought it would be a good idea to discuss one final piece of essential Phags-pa knowledge before moving on to other things in the new year.

Phags-pa Alternate Letters YA, SHA, HA and FA

That is, the four "alternate" letters for YA, SHA, HA and FA used in one particular Chinese Phags-pa text : U+A86D ꡭ PHAGS-PA LETTER ALTERNATE YAU+A86E ꡮ PHAGS-PA LETTER VOICELESS SHAU+A86F ꡯ PHAGS-PA LETTER VOICED HAU+A870 ꡰ PHAGS-PA LETTER ASPIRATED FA The Yuan dynasty rhyming dictionary Menggu Ziyun 蒙古字韻 is one of the most important Phags-pa texts, as it gives the pronunciation in Phags-pa script for over 9,000 Chinese characters.

However, the author (or editor of the only extant 1308 edition) attempts the impossible task of reconciling the pronunciation of the proto-Mandarin Chinese spoken at the time with the traditional phonetic classification of Chinese into thirty-six initials that had been developed during the Tang and Song dynasties several hundered years earlier. Notes. Phags-pa Fonts 2. I have now released two Tibetan style Phags-pa fonts, unimaginatively named "BabelStone Phags-pa Tibetan A" and "BabelStone Phags-pa Tibetan B", which may be downloaded from here.

Phags-pa Fonts 2

BabelStone Phags-pa Tibetan A BabelStone Phags-pa Tibetan B These fonts are both modelled on Phags-pa script primers such as the example below (with Lantsa, Tibetan, Mongolian, Chinese and Cyrillic) obtained by a Buriat Cossack officer, Tsokto Garmeyevich Badmazhapov, in 1903. The only difference between the two fonts is that the letters THA [U+A849], MA [U+A84F], TSHA [U+A851], DZA [U+A852], WA [U+A853], ZHA [U+A854], SHA [U+A85A] and Subjoined RA [U+A871] have slightly different letterforms. Source : Nicholas Poppe, The Mongolian Monuments in ḤP'AGS-PA Script (Otto Harrassowitz, 1957) page 16. The Tibetan style of Phags-pa script is squarer and more compact than the original script used during the Yuan dynasty, but like its predecessor it is still written in vertical columns running from left to right.

重修崇慶院之記. Phags-pa Fonts 3. The final Phags-pa font that I am releasing for the time being is BabelStone Phags-pa Seal, which is a "seal script" style font that can be downloaded from here.

Phags-pa Fonts 3

This has been a very difficult font to implement, and I am not terribly satisfied with the results, but rather than labouring on it any further I will release it now and maybe work some more on it at some future date. As with other scripts of this period that were constructed for writing non-Chinese languages (e.g. Tangut and Khitan), Phags-pa has a special pseudo-archaic "seal script" style of lettering that imitates the often convoluted forms of Chinese characters that are used on Chinese seals.

This form of Phags-pa lettering is almost exclusively restricted to use on seals (mostly official seals from the Yuan dynasty), although it is also used for the title (碑額) on some Yuan dynasty monumental inscriptions. The next question was how to make the alternate glyph forms of each character available to the user. More about Phags-pa Seal Script. As I mentioned recently, the rhyming dictionary Menggu Ziyun 蒙古字韻 provides a list of seal script forms of Phags-pa letters. As this is the only source of information about Phags-pa seal script letters other than actual Yuan dynasty seals it is obviously a very important resource.

However, the table of letters is quite confusing, and so this is my attempt to make some sense of it all. Source : Exact facsimile transcription of Menggu Ziyun in Luo Changpei 羅常培 and Cai Meibiao 蔡美彪, Basibazi yu Yuandai Hanyu 八思巴字與元代漢語 (Beijing: Kexue Chubanshe, 1959) -- for those interested a revised edition of this important work was published in 2004. The first problem with this table is the order of the letters in it. The table below rearranges the original table of letters, with each group of letter forms on a separate row. A Brief History of CJK-C. 騰蛇游霧,飛龍乘雲,雲罷霧霽,與蚯蚓同,則失其所乘也。 My friend Asmus Freytag (who has just retired from active participation in Unicode after many years of dedication to Unicode and WG2) recently bemoaned the total lack of interest in CJK-C on the public Unicode mailing list.

Whilst it is true that there has been little overt interest in the latest addition to the already huge collection of CJKV ideographs in Unicode , behind the scenes a lot of people have been working very hard on reviewing the CJK-C repertoire and resolving issues, and it has generated (and is continuing to generate) a huge volume of email traffic. The Ideographic Rapporteur Group (IRG) One aspect of the encoding process that I deliberately avoided in my post on Unicode and ISO/IEC 10646 is how new CJK ideographs get added to the standards. CJK Unified Ideographs : To Infinity and Beyond. It has been remarked now and then that Unicode basically consists of an innumerable number of Han thingies to which assorted non-Han detritus has attached itself. And this does seem to be borne out from the figures : Looking 10 years or so into the future, after the encoding of CJK-C, CJK-D, CJK-E and CJK-F, as well as Old Hanzi, even after taking into account large non-Han scripts such as Egyptian Hieroglyphs (~1,000), Tangut (~6,000) and Jurchen (~1,000), it is likely that the Han percentage will still be around 75% of the entire Unicode repertoire (this is assuming that Old Hanzi are classified as belonging to the Han script, which is not entirely certain).

It could also be said that Han ideographs are the driving force behind Unicode. The Han Script Not included within the Han script are CJK Strokes and Ideographic Description Characters, which are both classified as "common" by Unicode. But, however many ideographs are encoded, it always seems possible to find yet more to encode. CJK-B Case Study #1 : U+272F0. The CJK Unified Ideographs Extension B [13MB] block that was added to Unicode/10646 in 2001 comprises 42,711 characters, and it is no secret that there are many problems with this huge collection of mostly quite rare characters, including hundreds of cases of unifiable characters that have been erroneously encoded separately and even a handful of completely duplicate characters. There is enough material to keep a dedicated CJK-B blogger busy for years to come, but I certainly don't want to go down that particular path.

However, a recent psot by Michael Kaplan, How bad does it need to be in order to be not good enough, anyway? , about discrepancies in Han character stroke counts provided by China on the one hand and Taiwan on the other set me investigating one particular character, and its history is convoluted enough to be worth writing up as a case history. The Secret Life of Variation Selectors. Prototyping Tangut IMEs, or Why Windows 7 Sucks. Diacritics Project. EKI. Greek Unicode. These pages discuss issues to do with Greek and Unicode that I have come across over the years, both in my capacity as research associate of the Thesaurus Linguae Graecae and independently. This is not of course the last word on Greek and Unicode, nor indeed a comprehensive guide. That honour belongs, in the first instance, to: P.T.O'Rourke's Unicode Polytonic Greek for the World Wide Web (UPGW3), hosted by the Stoa Consortium.

Polish Diacritics: how to? ScriptSource. Serbian Cyrillic Letters BE, GHE, DE, PE, TE. Undefined. Sorting it all Out. Can I get my characters into Unicode? Super CJK. Tibetan Machine Uni.