Multilingual text on Linux

1. Unicode

Traditionally, for storing texts in various languages, special encoding methods are used, for instance Latin-1 (1 byte per character) for West-European languages with accented letters, KOI-8 for Russian, or EUC-JP (2 bytes per character) for Japanese.

Only very limited ‘mixing’ of languages (using them together in the same document) is possible in these systems. You can combine English with Japanese, but only because English normally does not use any accented letters, so only ASCII values below 0x80 have to be used. Mixing European accented letters with Japanese is virtually impossible, let alone adding other languages. With KOI-8 it is the same: Russian can be mixed only with un-accented Latin letters.

The only solution is to have a character set with characters for all the world’s languages. This exists and is called Unicode. In Unicode, each character is assigned a number in the range 1 – 10FFFF hex (more than a million possible characters). This one million character space is divided into 17 ‘planes’ of about 65000 characters each. The first plane (the character range 1 – FFFF hex) is called the ‘Basic Multilingual Plane’ (BMP) and contains most characters which are used at all frequently.

Making your computer handle all languages is called ‘multilingualisation’ or M17N. Interest in M17N is still limited. Most people get by with mono-lingual ‘legacy’ systems, set for their own country’s language.

However, having your system enabled for M17N by means of Unicode is really very convenient. Once it is set up, messing about with code settings (ISO-8859-1 or Windows-1252? Shift-JIS or EUC?) is a thing of the past. You can (mostly) forget about ‘locales’. Unicode is definitely ‘the wave of the future’.

In Unicode, the numbers 1-FFhex (1-255 decimal) are used for the Latin-1 character set (of which the numbers 1-7F are the traditional ASCII characters). After that, there are codes for all sorts of accented Latin letters beyond Latin-1, for Russian, Greek, Arabic, Hiragana, Katakana, and much more. For instance:

41 hex = A
E9 hex = é (e with sharp accent)
14D hex = ō (o with macron)
3A3 hex = Σ (Greek capital Sigma)
306F hex = (hiragana HA)
5229 hex = (kanji RI)
B300 hex = (Korean syllable DAE)
92C+93F hex = बि (Devanāgarī BI; may display, or print, incorrectly)

The Unicode standard defines numbers only for the characters themselves, not for the way the characters are displayed (for instance bold, italic, monospaced; in which font; and at what size). It is a pure text standard. Also, the Unicode standard defines just the numbers (as mathematical entities); it does not specify how the numbers should be encoded, i.e. converted into bytes for transmission and storage. ‘Encoding’ can be done in a surprisingly large number of ways. The two most common (but by no means the only) encodings for Unicode characters are called UCS-2 and UTF-8.

If you save a text in Openoffice by means of ‘Save As, Text Encoded, Unicode’ it is in fact saved as UCS-2.

In UCS-2 every character has a fixed length (two bytes). Only characters in the range 1 – FFFF (i.e. the BMP) can be encoded, but this is enough for many practical purposes. Every character is sent as follows:

  • for characters 100 hex and higher: the most significant byte (of the hex character number) first, then the other byte.
  • for the Latin-1 character set: first a zero byte, then the character value.

A variant of UCS-2, called UTF-16, allows encoding the whole Unicode character set.

There are some disadvantages in using UCS-2 when texts contain a lot of ASCII, because each ASCII byte is accompanied by a zero byte. In the first place this doubles the size of ASCII files. Also, you cannot grep (i.e. search) for ASCII words in an UCS-2 file. Furthermore, in traditional C programming, a zero byte marks the end of a string. So UCS-2 strings tend to seem rather short to C library functions like strlen(). Of course there are ways to get around this. Microsoft uses UCS-2 for internal character representation in Windows, and several Linux programs do this (internally) as well.

For a more detailed explanation of UCS-2, UTF-8, and other encodings, see Markus Kuhn’s Unicode FAQ.

The other often-used system is UTF-8. Openoffice also allows you to save a text as UTF-8. This sends traditional ASCII (below 80 hex) just as it is. The higher Unicode characters are sent using more (max. 4) bytes, each with the high bit set. For instance, accented European letters take 2 bytes, Japanese characters take 3 bytes. By using 4 bytes, the whole Unicode character set (up to the theoretically maximum character value, 10FFFF hex) can be encoded. For an explanation of how this exactly works, see RFC 3629.

One of the advantages of UTF-8 is that pure ASCII files are automatically also valid UTF-8 files. Another is that you can use UTF-8 strings in C programs without special declarations of any kind.

It is generally accepted (by everybody, including Microsoft) that UTF-8 is the best way to encode Unicode texts for transmission between computers. In HTML pages, UTF-8 byte sequences can be used ‘as is’, provided the HEAD section contains a line like

<meta http-equiv=”CONTENT-TYPE” content=”text/html; charset=utf-8″>

It is also possible to put Unicode characters in an HTML file in the form of ‘decimal entities’ (so that is written as &#26481;). The number is just the Unicode number for the character, written in decimal. This can be used in Web pages in which the ‘main character set’ is not Unicode. But this seems rather pointless; why not make the character set for all your pages just Unicode itself? All browsers nowadays understand it.

UTF-8 also appears to be the best for internal storage on Unix/Linux systems, if only because Unix/Linux uses a lot of scripts and config files written in ASCII.

 

2. UTF-8 support under Linux

In what follows I investigate the state of UTF-8 support under Linux (especially Debian, the newest version, called ‘unstable’ or ‘Sid’) from a user point of view. ‘Support’ has several aspects:

  1. Applications must be able to accept UTF-8 strings and files, and must be able to display them.
  2. You should be able to perform keyboard input, in a variety of languages, easily.
  3. Printing UTF-8 documents, web pages, e-mail messages, or whatever, should work.
  4. Copying (by selection from the screen) and pasting from one application to another should work.

In all these areas, MS Windows has performed pretty well for years. If you install the correct fonts, and Microsoft’s ‘Global IME’ components for input, your computer becomes, in effect, multilingual.

Ubuntu already includes a lot of ‘exotic’ fonts by default; in Debian itself, you must, in general, install the non-Western font packages separately.

Modern Linux distributions are also completely multilingual, thanks to UTF-8. Strangely enough, UTF-8 is not the default on some systems (as I found out recently when I installed Ubuntu Hardy on a new computer), but when UTF-8 is made the default it works perfectly. Of course if you want to display ‘exotic’ scripts you need appropriate fonts.

Problems which used to exist, e.g. with complicated (‘Complex Text Layout’) scripts like Devanāgarī, have disappeared. xterm, and almost all text-based utilities like less, now support UTF-8 (there are still a few exceptions). Apart from xterm a few other terminal emulators (mlterm, konsole, urxvt,…) also understand UTF-8 (some do not, like aterm, Eterm).

Copying & pasting works ‘from everywhere to everywhere’ with modern (2004+) Linux versions.

Printing UTF-8 is no problem in modern versions of Linux; however, if your distribution fails to print UTF-8 pure text files correctly, it may be useful to look at section 9.1.1 of this document (about paps). Printing from Mozilla products used to be problematic, but the problems have disappeared since the arrival of Firefox 3.0.

True multi-lingual keyboard input à la MS Global IME became possible in 2004. No less than three systems (‘input method frameworks’) can provide this now: uim, scim, and IIIMF. The ‘multilingual input problem’ on Linux can now be considered solved. In this document I only describe uim and scim in any detail.

There are some excellent mailing lists about UTF-8 support on Linux:

  • linux-utf8
  • opensuse-m17n

 

3. Making Unicode the default on your Linux system

Every Linux distribution, unfortunately, has its own distribution-specific way to make Unicode the default. Many distros do it already ‘out of the box’. It seems that the ‘Etch’ edition of Debian does this also.

In Debian versions where UTF-8 is not the default, until recently you had to put in /etc/environment something like:

LANG=en_GB.UTF-8

However, in April 2006 in Debian ‘testing’ and ‘unstable’, a new method was introduced: you must call  as root (e.g):

update-locale LANG=en_GB.UTF-8

The ‘locale’ value will now be stored in /etc/defaults/locale (instead of /etc/environment). Normally update-locale will be run automatically when you install or upgrade; a dialog box will ask what you want the default locale to be. Running update-locale (or editing /etc/defaults/locale) is only necessary when you want to change the locale. The following points are important:

  • The value that the LANG variable is set to is called the ‘locale’ of your system. You should only choose values which are enabled on your system; locales are enabled by editing /etc/locale.gen and running locale-gen afterwards, or (in Debian) by running dpkg-reconfigure locales.
  • You can find out which locales are enabled on your system by calling locale -a.
  • The part of the locale after the dot (the UTF-8 part) makes your system a ‘panlinguistic’ Unicode system.
  • Before the dot there is a language indicator, in this case en_GB. It is used by properly ‘internationalised’ programs to display messages in the correct language, and to display dates, numbers, man pages etc., in the correct language and format. Also, it seems Mozilla uses it to determine the default font for displaying Web pages. For ‘Western’ languages like en_GB, the ‘Western’ font will be used. It doesn’t seem to matter much which ‘Western’ language is specified; for Mozilla, e.g. de_DE.UTF-8 is, as far as I can see, equivalent to en_GB.UTF-8. May be wrong, of course.

Possible memory problem when generating UTF-8 locale

locale-gen and dpkg-reconfigure locales both call another program, localedef, which consumes lots of memory when working with a UTF-8 locale. No other program on a Linux system is as memory-hungry as localedef; Mozilla and Openoffice are nothing by comparison. You need at least 100 Mb of free memory (that is, RAM and swap memory together) to run it.

If you experience problems (i.e., if you see a message that localedef has been ‘killed’ by the kernel), you should (temporarily) create extra memory by means of a swap file. Google provides many links to sites which explain how to do this.

After the locale has been successfully generated, memory swapping to the swap file can be switched off (swapoff command) and the swap file deleted. You must then reboot, because RAM memory is not properly freed by localedef, and many things will go wrong until your computer is restarted. Can’t understand why nobody fixes this program, but the gurus concerned have the attitude ‘well, you should just get another computer with more memory’ (see, for instance, this thread).

Most programs nowadays run well inside a UTF-8 environment, but occasionally you may find one which does not (texmacs seems to be an example). Such ‘problematic’ programs are gradually becoming UTF-8 aware; for instance, the file manager mc (‘Midnight Commander’) did so in 2006. Also, for a while I had problems with man, but in the course of upgrading my system they have disapppeared. All standard utilities seem to be alright now; they discover (through the locale) that they are in a UTF-8 environment, and act accordingly. For instance wc -m (which counts characters in a text) counts a Japanese UTF-8 character (3 bytes) as one character.

The non-English man pages themselves (in /usr/share/man/ja /usr/share/man/ru etc.) are still in ‘legacy’ encodings in Debian (EUC-JP, KOI-8, iso8859-1). However, the man system converts them ‘on the fly’ to UTF-8 when your locale is a UTF-8 one.

NOTE: if you have some non-English man pages installed, they can most of the time be viewed correctly even in an ‘English’ locale on your UTF-8 system using the -M (manpath) option of the man command. For instance, to see the Russian man page for the chage command:

man -M /usr/share/man/ru chage

 

4. Some font issues

To display Unicode characters you must have fonts which include them. Reasonably modern Truetype fonts like the Microsoft Core Fonts, as well as free fonts like the Freefonts and the Dejavu fonts (successors to the well-known Bitstream Vera Fonts), contain many Unicode characters apart from just Latin-1. Many have Russian and (at least modern) Greek; Freemono has Hebrew and International Phonetic Alphabet; Times New Roman and Arial have Hebrew and Arabic. Free Sans does many Indic scripts; Freemono does Amharic, etc.

Alan Wood’s Unicode Resources helps to find fonts for various ‘ranges’ of characters in the Unicode character set.

For esthetic reasons, FreeSerif definitely needs some ‘leading’ (extra space between lines; see section 9.4).

The freefont fonts

Freefont is a project to provide fonts for all alphabetical characters in Unicode in three styles (‘Free Sans’, ‘Free Mono’, and ‘Free Serif’), with a typographically consistent look. In Debian you can get them by means of apt-get install ttf-freefont.

The project’s manifesto looks like John Baskerville let his hound chew on the movable type, but the fonts themselves are very nice really. FreeMono is very useful for printing monospaced Unicode text: it has a much wider character coverage than Monotype Courier New. FreeSans and FreeSerif also have very extensive coverage (including Classical Greek, and many Indic scripts), out-performing Arial and Times New Roman in this regard. I set them as default fonts in my browser.

If you are interested in international text, by all means install the Freefonts! In fact, this article more or less assumes you have done so. Make sure you have the most recent versions; before the middle of 2006, there were still several bugs in the Freefonts.

If you install a new .ttf font ‘by hand’ (not as a ‘package’, e.g. a Debian font package) you can put it in /usr/share/fonts/truetype/ or in a subdirectory thereof. Most of the time this is sufficient (at least for fontconfig-using applications, see section 4.2) to recognise the fonts. If some application does not recognise them, stronger measures may have to be taken; in the case of Debian, this might involve registering the font using dfontmgr, and also doing the following as root:

cd /usr/share/fonts/truetype
mkfontscale
mkfontdir

 

4.1 Character Map

Symbols and fonts can be investigated with the program gucharmap. Set the font with the drop-box on the top left. Right-click on the symbols on the right to see if they are from the selected font, or from another font. This is a nice way to explore the whole Unicode character set (i.e. the parts that you have fonts installed for).

gucharmap can be used to copy-paste any Unicode character, or string, into applications. Characters can be inserted into the copy buffer by clicking on them twice (in the table on the right), or by typing directly into the copy buffer. gucharmap supports the GTK immodule system, so by just rightclicking in the buffer you can select various input methods.

 

4.2 Fontconfig Magic

By some kind of magic called ‘the fontconfig library’ (Debian packages fontconfig and libfontconfig), if a symbol is not present in the selected font, applications can find another font which contains it. It works (approximately) by configuring the application to use fonts with certain characteristics, rather than fonts with specific names. For instance, newer versions of Mozilla are configured (at least in Debian) to use a font called monospace whenever a fixedwidth font is needed. A font with that name does not exist. But libfontconfig tries to find it. The file /etc/fonts/fonts.conf specifies that the preferred font for ‘monospace’ is Bitstream Vera Sans Mono. But other fonts, like Courier New and Andale Mono, may be used if B.V.S. Mono is not available, or when a certain character is not available in the font.

In practice this means that when monospace is the font name, the ‘normal’ Latin letters will be taken from B.V.S. Mono, while ‘Latin Extended A’ characters like ā, ē, ī, ō, ū, will be borrowed from Andale Mono.

The use of this clever mechanism is not yet universal in Linux systems. For instance, the Konqueror browser on my system originally did not use it; it was configured to use B.V.S. Mono itself. After I changed this setting to monospace, Konqueror’s understanding of UTF-8 improved.

NOTE: such effects no longer occur when the Dejavu fonts are installed instead of the Bitstream Vera fonts; the Dejavu fonts have the same design as the Vera fonts, but much wider character coverage.

This magic is useful, but sometimes it produces surprising results. For instance Bitstream Vera Serif does not have ‘Latin Extended A’ characters (numbers 0x100 to 0x17f), but Times New Roman does. So if your default font is B.V. Serif, the letters with macrons on them are taken from T.N. Roman. This looks a bit jarring because T.N. Roman and B.V. Serif are typographically so different.

 

4.3 Korean

For Korean, the free Baekmuk ttf fonts can be used. They are enormous in size, because there are pre-combined forms for, it seems, every theoretically possible Korean syllable (numbers 0xAC00 to 0xD7A3). Another good set of Korean fonts is provided by the UN fonts (in Debian: package ttf-unfonts), which are even bigger, but which may have typographical advantages. Just try them out.

 

4.4 Japanese and Chinese

There is no separate ‘Japanese kanji’ section in the Unicode system. There are ‘unified kanji’ for all kanjiusing languages together. The basic set of unified kanji runs from 0x4E00 to 0x9FA5. In this huge number of kanji (more than 20,000) all the characters for daily use and ‘normal scholarly use’ are contained. Japanese fonts like Kochi Gothic provide (from these) the ones that are standard in Japan; the rest is displayed as a box with four hex digits in it.

Other sets of ‘unified kanji’ can be found e.g. in the ranges for ‘CJK Unified Ideographs, Extensions A and B’: 0x3400 – 0x4DB5 and 0x20000 – 0x2A6D6 (together about 50,000 characters). These appear to contain mostly ‘historical’ forms, found in classical books and dictionaries.

Actually, only the Mainland Chinese write the character like that. In Hong Kong () and in Taiwan () the top of the ‘bone’ character looks the same as the Japanese version.

Remarkably,  complaints can occasionally be heard that Unicode does not provide enough kanji. A famous ‘bone of contention’ is the character which the Japanese write as and the Chinese as (Unicode number 0x9AA8), which means bone.

The difference in the text above can be seen in your browser only if you installed both a Japanese font (e.g. Kochi) and a ‘simplified’ Chinese font (e.g. Arphic Sungtil GB), and told your browser about it. Here are .png pictures of the Japanese and Chinese ‘bone’ characters:  (J)  (C). (NOTE: for ‘traditional’ Chinese as used in Taiwan, the font AR PL ShanHeiSun Uni can be used. For Hong Kong, the font AR PL Mingti2L Big5 would be better.)

This account of the ‘bone’ controversy is highly simplified. For a more thorough discussion, see, e.g., this thread.

The little square inside the square on the top is on the right in the Japanese version, on the left in the Chinese version. In ‘proper’ hand-writing, the Chinese version can be written slightly faster (one stroke less), and some Japanese consider the two characters different because of this difference in ‘stroke count’. However, the Unicode working groups (which included a lot of Japanese scholars) considered them basically the same, with the difference being just (typo-)graphical. Just choose a Japanese font if you want to show the version which is used in Japan (as I did above); no need to give the two variants different numbers.

At the time of writing (May 2007) not all chapters of version 5 of the standard have been made available as .pdf yet. Select version 4.1 if you want .pdf’s. Part of the problem seems to be that the official Unicode standard (now version 5.0; printed as a paper book, and downloadable from www.unicode.org as a number of .pdf files) includes a character list in which the ‘bone’ character (0x9AA8) is shown only in its Chinese form. This apparently made some Japanese people think that Unicode was not suitable for writing Japanese. But Unicode says nothing about how the bone character should be written. That is left to the users (and the fonts they have installed).

 

4.5 Greek

Greek is an example of a language which has too many characters defined in Unicode. For instance, the character ‘alpha with sharp accent’ has been defined twice. With Unicode number 0x3ac it looks like this: ά, and with Unicode number 0x1f71 it looks like this: . Depending on the Greek font you have installed, these characters may or may not look different in your browser or in print. Look at the angle of the accent with respect to the vertical.

Officially, the character 0x1f71 should not be used. In fact, it should not have existed at all. 0x3ac should be used instead. But because of some typographical fashion in Greece itself during the 1980’s, in many fonts 0x3ac does not look right when displaying classical Greek. That is why the 0x1f71 character was invented. This is now considered to have been a mistake. However, the genie is out of the bottle: classicists overwhelmingly tend to use 0x1f71 (as can be seen by Googling for Greek words containing 0x3ac and 0x1f71 respectively).

The ‘officially’ correct solution is to use 0x3ac, and to have a font in which the 0x3ac character looks OK for classical Greek (i.e. like the 0x1f71 character). Typographical fashions in Greece have now changed, so this solution is right for modern Greek also. If the classicists will really make this turn-about remains to be seen. They might, because most fonts now behave ‘correctly’.

Many fonts, including the ‘Microsoft Core Fonts’, do not include ‘Classical Greek’ accented characters at all.

Examples of fonts with ‘non-correct’ Classical Greek accented characters are efont (for xterm) and older versions of FreeSerif. An example of a ‘correct’ font is Gentium (also a Debian package: ttf-gentium); very nice for printing. It includes Latin and Greek; the Latin characters are (unfortunately) horrible for screen display, so you’d better tell your browser to use Gentium for Greek only! The latest FreeSerif versions (2006+), as well as the DejaVu fonts, also display the accents correctly.

For details, see, e.g. Nick Nicholas’ site, and the linux-utf8 mailing list.

 

4.6 Indic scripts

Many Asian scripts (especially of the languages of India) are complicated to display (and print) on a computer. This is not because they have an especially large number of characters (like Chinese and Japanese have); the difficulty is in the way the characters are arranged to form words and sentences. Such scripts are also called Complex Text Layout (CTL) scripts.

The Freefonts offer good coverage of many Indic scripts; if you want more typographical choice, you could use e.g. the Debian package ttf-indic-fonts.

An example is in the picture below. This illustrates one (still relatively simple) feature of the Devanāgarī script (a left-to-right ‘alphabetical’ script, used for e.g. Hindi and Sanskrit). The example uses the FreeSans font.

In Devanāgarī, there are (in principle) no separate consonants. The vowel ‘a’ is always implied. So there is no sign for ‘k’, but there is one for ‘ka’. If you want to write ki, ko, ku, etc., you must combine the ‘ka’ sign with a vowel sign. This functions as a kind of ‘accent’. The vowel sign must be attached somewhere to the ‘ka’ sign, at a location called an anchor. The picture shows how to form ‘ki’ and ‘ku’.

Now I suppose this could have been done by defining ‘ku’, ‘ki’, etc., to be separate characters in Unicode. This is what was done with Korean, where thousands of different letter combinations were each given a separate Unicode number – in effect making them different characters. As a result, Korean font files are huge (multi-megabyte), even though Korean is an ‘alphabetical’ script, with a small number of basic characters. The keyboard input system has to select the combination to be displayed (this is fairly easy in the case of Korean).

In Indic Unicode scripts a different approach was chosen. The number of characters is kept limited. The font files are therefore relatively small. Combining and arranging characters is done only at the display (or printing) stage, by means of the ‘anchor’ system and various other tricks. Technically, this is more difficult, but the technical difficulties have been overcome.

Bitmap fonts don’t have a GPOS table (they only contain images of the characters), so in general, they don’t show CTL text correctly. In particular, you would see the ‘i’ vowel sign to the right of the base character, not to the left.

The information on how to arrange the characters must be in the fonts. Modern .ttf and .otf fonts have ‘anchor’ information in a special section of the font file, called the GPOS table. The computer (i.e. the part of the software that actually puts the characters on the screen) must be able to correctly interpret the instructions contained in this table. Openoffice has always been pretty good at this, but Mozilla and its derivatives (on Linux) always had problems. However, the new Firefox 3.0 (called Iceweasel by Debian) can display and print CTL correctly. Firefox 2, on Fedora-based distributions, using another approach, could do this already by the end of 2006.

xterm cannot handle CTL, but mlterm can (when used with a .ttf font). konsole can do it also. The typesetting library pango, and therefore also the text print program paps which uses it (see section 9.1.1), handles CTL correctly.

For inputting Indic scripts, input method frameworks like uim, scim, and IIIMF can be used (see section 6.4), in combination with suitable Indic input methods (supplied by the m17n library in the case of scim and uim). With the proper method selected, you just type ki, and कि will appear on the screen. For more information about Indic input, see section 6.5.4.

4.7 Right-to-left (RTL) languages (Hebrew, Arabic, Persian …)

From the viewpoint of a computer, text is not written ‘right to left’, nor left-to-right, top-to-bottom, or whatever. To a computer, a text is just a file, which is written from the beginning to the end. Every person knows what the beginning and the end of a text is! And every computer (or more correctly, every operating system) knows what the beginning and end of a file is. This suggests what we could call the Cardinal Rule of text handling in computers: what comes first in the text should come first in the file. The difference between RTL languages and other languages is just in the way the file is displayed.

Imagine you happen to use a language in which the word for ‘HELLO’ is also HELLO, but which uses a writing system which shows it as OLLEH. Then you need a display system which displays the word as OLLEH. But you think of the word as HELLO, and your computer should store it as HELLO with H at the beginning and O at the end.

Unicode display systems on a computer can decide how to display text purely on the basis of the characters which are to be displayed. They use a decision mechanism (or algorithm) which basically goes like this: if the characters are:

  • Latin, Greek, Russian, Chinese, Japanese, … : display LTR
  • Arabic, Hebrew, Persian … : display RTL

Many display systems on Linux can use this method, known as the bidi (or bi-directional) algorithm.

In the (recent) past, many display systems could not use the bidi algorithm. To cope with this, people used to store RTL text backwards (in so-called ‘visual order’) for it to be displayed, in violation of the Cardinal Rule. This, however, led to all kinds of horrible complications. To be honest, bidi itself also leads to horrible complications (for instance: how exactly do you put fragments of RTL text between parentheses?) but these have generally been solved.

Even now there are some display systems which do not understand bidi. One example, alas, is (at the moment) xterm. When UTF-8 text editors (like mined and joe) run on xterm, Hebrew and Arabic are displayed the wrong way round. There used to be a project called xterm-bidi which provided a bidi patch for xterm, but it seems to have died. Alternatives to xterm are mlterm and konsole, which do understand bidi (it has to be activated as a special option in konsole). Otherwise, to edit RTL texts comfortably, you have to use GUI editors like bluefish.

If you do this in mlterm, which has fribidi built-in, the RTL text will be turned around twice, making the display unreadable again!

Even inside a non-bidi-aware display system you can display (but not edit) RTL texts, if you have the package libfribidi0 installed. E.g. within an xterm you can call cat arab.txt|fribidi|less -r. The fribidi program (a utility included with libfribidi0) turns the Persian / Arabic / Hebrew parts of the text around, giving a readable display.

Most word processors, like Openoffice, and most print systems, including paps, understand bidi without problems.

 

4.8 Unicode in Web pages

Language codes for Chinese:
Mainland: zh-hans
Taiwan: zh-hant
Hong Kong: zh-hk
This works in Firefox- type browsers at least.

Although Unicode is one character set for all languages, it is still advisable to include language information in your HTML, especially to distinguish between Chinese and Japanese. If you omit the language information, the rendering by Unicode-aware browsers will be readable anyhow (that is the whole point of Unicode), but the result may not be optimal or (according to some) even ‘correct’, as explained above.

This is even more necessary when pages must be viewed in the Mozilla browser. Mozilla comes in many different variants (e.g. ‘xft’ and ‘non-xft’ versions; Linux and Windows versions), which each seem to have their own particular method of selecting fonts. To be certain that the user’s preferences for various languages will be honoured, mark them in the HTML.

The ‘Debian font manager’ (defoma) provides the most shameful illustration of the Linux font selection mess: read the man page of defoma-user, and weep. E.g. to mark the whole ‘body’ as English: <body lang="en">. To mark some part as Japanese: this line contains <span lang="ja">日本語</span> and English. I have ‘language marking’ in several places (although not everywhere) in this document; to check, use ‘view’, ‘page source’. Later versions of this page will explain font selection mechanisms in more detail. That is, when I understand it myself; this may take a very long while because font selection on Linux is such a mess.

If you do not include any language information, Mozilla, when presented with a UTF-8 page, will just assume some default language based on the ‘locale’ of the user’s system. Sometimes quite messy display may result.

 

4.9 Bitmap Fonts

Unicode bitmap fonts for use with xterm exist, for instance in the Debian package xfonts-efont-unicode. You can get a multilingual xterm by means of the command

xterm -u8 -fn ‘-efont-fixed-medium-r-*-*-16-*-*-*-*-*-iso10646-*’

To check which characters are available in efont, you can use the following command:

xfd -fn ‘-efont-biwidth-medium-r-*-*-16-*-*-*-*-*-iso10646-*’

This shows that apart from all kinds of accented Latin letters, you also get Greek, Russian, Arabic, Hebrew, Thai, Japanese, Korean, etc.; a very full Unicode implementation indeed. With efont, xterm even gives pretty good results with ‘multiple combined diacriticals’ (see below). Unfortunately, efont does not support Indic scripts apart from Devanāgarī; but the Indic scripts (having ‘complex text layout’) do not work well with bitmap fonts anyway (see  section 4.6).

The efont fonts in fact provide two versions of themselves, called fixed and biwidth respectively. To start an xterm you have to use the fixed version (or you will get display problems). However, xfd only shows the full contents of the font if it is called with biwidth. I don’t know why.

Markus Kuhn’s Unicode fonts and tools for X11 page also has Unicode bitmap fonts.

I give a description of how to configure xterm to use Unicode fonts on a permanent basis here.

NOTE: xterm, at the moment, cannot display righttoleft (RTL) texts correctly. An alternative which can is mlterm. See section 4.7.

 

4.10 Unicode on the console

There is very little support for Unicode on the console. There is a command called unicode_start, the man page of which says

The unicode_start command puts both the screen and the keyboard in Unicode mode [..]

But, in fact, it seems that this command (which you can undo with unicode_stop) only puts the screen into UTF-8 mode. E.g., the UTF-8 character é (=hex C3 A9) is correctly displayed (if you have a console font which has the é). But typing the é will not work. And console fonts can have only a very limited number (256, or, at a pinch, 512) of characters, so full Unicode support seems difficult, to say the least.

The subject of ‘console support for Unicode’ comes up regularly on the linux-utf8 mailing list. It seems that to make it work, extensive changes in (in fact, large additions to) the kernel would be needed. This simply does not seem worth the effort. So don’t count on real Unicode support for the console very soon. Run your Unicode text-mode programs in X, using xterm, which has good Unicode support.

UTF-8 ‘console’ support requires framebuffer

The consensus now seems to be: the actual Linux console is a terminal device. You could, and I once actually did this as an experiment, control a Linux system through a ‘dumb terminal’ connected to a serial port. You could probably use an old ASR 33 Teletype if you really had to. But it would be unreasonable to expect the mechanical ASR 33, with its solid metal printing head, to be able to print the entire Unicode character repertoire. The same applies to the ordinary console support provided by the kernel. The kernel will provide you with ‘emergency’ services, in case the system is in trouble and you want to connect in single-user mode. You should not expect anything more.

People who do insist on ‘console’ support for Unicode should provide for it in user space, not kernel space. Various projects for doing just that, based on using a framebuffer console, are actually under way, see, for instance, here. But IMHO, if you can do framebuffer then you could just as well do X.

 

4.11 A note on combining diacriticals

Many letters with diacriticals (‘accents’) are defined in Unicode. For instance a with sharp accent, á, has number 0xE1.

The mined editor has facilities for entering combining diacriticals, and displaying them both as ‘combined’ characters and split into their components.

The Unicode specification also has ‘combining diacriticals’ (with numbers in the range 0x300 – 0x36F). These, when placed after an ordinary character, should be displayed in combination with the preceding character as ‘one glyph’. So the character a, 0x61, followed by ‘combining acute’, 0x301, should also display as á. In this browser, with this font, it displays as á.

There are rules (in the Unicode standard, chapter 2.10) on exactly how this should be done. E.g. if a letter a is followed by two accents above, the first accent should be displayed directly above the a, the second one above that. This way, any Unicode character can be adorned with any number of combining diacriticals above, below, or even to the side of it. This can be used to make ‘new’ characters for which no entry in the Unicode table exists.

For ‘combining diacriticals’ to work (i.e. for them to be displayed and printed correctly) both the fonts and the software that displays them have to fulfill certain requirements. In fact, ‘combining diacriticals’ are a type of ‘Complex Text Layout’. Fonts for ‘real’ CTL scripts always include mechanisms (anchors, GPOS tables…) for displaying, e.g., the languages of India; they have to, otherwise they would be unusable. However, support for ‘combining diacriticals’ in Western fonts is still very limited. Few people seem to care, because ‘pre-combined accents’ are widely available in Western fonts. For some discussion of this problem, see this thread.

Browsers generally do not handle combining diacriticals well. As a browser display test, here are some examples of combining diacriticals:

  • a + combining acute: á
  • a + combining tilde + combining acute: ã́
  • a + combining circumflex + combining caron: â̌
  • a + combining caron + combining circumflex: ǎ̂

This is rarely displayed well. If works reasonably well with the latest Mozilla variants using the Dejavu Serif font, but that is the exception rather than the rule. In practice this means that ‘combining diacriticals’ should be avoided; ‘pre-combined’ single characters should be used whenever possible. If such pre-combined characters don’t exist (yet) in Unicode (i.e. with really exotic languages, like some African ones) the solution with combining diacriticals is (and probably will remain) more theoretical than practical. This is a pity, because pre-combined characters are evil in principle. Combining diacriticals would be much better—if there were not so much political opposition against them, and especially if they would work.

 

5. Editing and word-processing text in UTF-8

 

5.1 Editing

In the first place you need an editor program which is UTF-8 aware. Running a non-Unicode-aware application will not work (even if run in a Unicode-capable xterm with the proper font). The application will still assume, for instance, that 1 character = 1 byte = 1 position wide on the screen. You will get all sorts of problems with cursor positioning, backspace, etc. Some possible editors are:

  • yudit. The first Unicode-capable editor under Linux. It has a built-in input method selector. The user interface is etwas gewöhnungsbedürftig (needs some getting used to) like the Germans say. I myself am not yet used to it.
  • kedit, when started with the UTF-8 option: kedit --encoding UTF-8. It can print UTF-8 using its own built-in method, which in my case did not give terribly nice results. I think it is better to use general text printing utilities (see section 9.1).
  • gedit. This works with the GTK-2 immodule system (see below). It is a UTF-8 editor (incl. RTL and CTL scripts), its user interface is very nice, but unfortunately it is very unstable; already a few years ago I found that it often crashed, with data loss. I tried it again recently (July 2006) and found that it has become even worse. Maybe you need full GNOME (which I do not have) to use it reliably.
  • bluefish. Especially for HTML editing. This also works with the immodule sytem. As far as I could find out, this is a stable program. Also does RTL and CTL.
  • mined. This is a text-mode editor (for use on the console, or more likely in a UTF-8 capable xterm; the other editors mentioned above all use a GUI). mined has a built-in keyboard input system for all kinds of ‘alphabetical’ languages (including Hebrew, Arabic, Greek, Russian …) and some support for Kanji-type languages. Mouse support (including scroll wheel). ‘Smart quotes’ support. Syntax highlighting. Capable of reading and writing many other encoding systems than UTF-8, e.g. JIS, EUC-JP. Very stable: it never crashed when I used it, unlike some of the editors mentioned above …
  • joe (my favourite editor). Also a text-mode editor. This fast and full-featured editor, with a user interface like Wordstar or the ‘Borland Turbo’ editors, has long been available for Linux. Versions 3.0 and higher have Unicode support.
    To actually see ‘exotic’ characters in joe, you should run it in an xterm with unicode fonts, and inside a UTF-8 locale. See section 3 and section 4.6. The new versions of joe now also have coloured syntax highlighting, which is very helpful when you are writing HTML (or C). No mouse support, however.
    For keyboard input, joe relies on what the (X) environment provides. This, I think, is good design philosophy; but good keyboard support from the environment, in Linux, is only a fairly recent achievement. That is why older multilingual editors like yudit and mined provide their own systems.
  • It goes without saying that the old stalwarts, vim and (X)emacs, are also capable of handling UTF-8. I don’t use them myself, but more information about using them with UTF-8 is here.

 

5.2 Word processing

Openoffice knows how to handle UTF-8, and prints it beautifully. For keyboard input, see the next section.

 

6. Inputting from the keyboard

There are various ways in which you can input multilingual text:

  • For alphabetical languages (like English or Russian) you can select various keyboard layouts by means of xkb utilities like the setxkbmap command, or by means of keyboard selection utilities in KDE and GNOME. Using the ‘general’ xkb facilities as described below, seems to work both on GNOME and KDE though (but you may have to change KDE’s defaults).
    Using the xkb resources you can also define a ‘hotkey’ to cycle through multiple pre-defined keyboard layouts.
  • Switching the keyboard layout changes the ‘meaning’ of the keys; e.g. the Q key on a QWERTY keyboard produces the letter Й if the keyboard is switched to standard Russian (‘Я’ if switched to ‘phonetic’ Russian, which is rarely used in Russia itself). You specify the default keyboard layout in the X configuration file, which is called /etc/X11/xorg.conf in the more recent Linux systems (in slightly older systems it is called /etc/X11/XF86Config-4).
  • For typing languages based on the Latin alphabet, keyboard switching is rarely necessary, because all required accents, etc., can be entered by means of ‘dead keys’ and the Compose key.
  • Rarely-used ‘exotic’ characters can be inserted in your text by pasting from a UTF-8 character map.
  • For Chinese/Japanese/Korean (CJK), there are input methods which convert series of keystrokes into characters or strings of characters. Examples of input methods for Japanese are the kinput2/canna combination (obsolete), anthy, and ATOK (commercial).
  • Input method frameworks offer an easy way of selecting both keyboard layouts and input methods. They can be extended with language modules, so you can enter almost any language in the world. Examples of input method frameworks are uim , scim , and IIIMF.

These various possibilities are described in more detail below. My understanding of the technology behind them is extremely hazy, so again I’ll take the user point of view: I just tell what you have to do to make it work. E.g. sometimes environment variables must be set; don’t ask me what these things actually do!

 

6.1 Setting up the default keyboard and the Compose key

The traditional way of customising your keyboard involves the xmodmap utility and the ~/.Xmodmap file. In modern versions of X, this is ‘deprecated’. If you have mysterious problems with your keyboard, try to find out if you have an .Xmodmap file in your home directory, and rename (or delete) it.

The keyboard layout is specified in the main X configuration file (/etc/X11/xorg.conf or /etc/X11/XF86Config-4), in one of the ‘InputDevice’ sections. The layout should match the actual layout of the physical keyboard that you have. So for instance if you have a German (QWERTZ) keyboard, there should be a line

Option “XkbLayout” “de”

The standard ‘us’ layout for ‘US QWERTY’ keyboards (which are used in the Netherlands and some English-speaking countries) has no special facilities for entering accented letters. You can produce accented letters with it only if you use the Compose key (see below). A more convenient method for entering accents is to specify a variant of the us layout, called alt-intl:

Option “XkbLayout” “us”
Option “XkbVariant” “alt-intl”

This makes all kinds of accents available on the keyboard by means of ‘dead keys’. For instance you can make ä by pressing first “, then a. The ” by itself can now only be made by pressing ” followed by a space, or by pressing ” together with the right-Alt key.

IMPORTANT: For experiments with the keyboard settings, it is not necessary to edit the config file. The setxkbmap command can be used instead. It can be called from any xterm and immediately affects all running programs. For example you can call:

setxkbmap us -variant alt-intl

The variant can also be put between parentheses; if you use them you must enclose the command in double quotes:

setxkbmap “us(alt-intl)”

Apart from alt-intl, there is another international variant of the US keyboard layout:

Option “XkbVariant” “intl”

This has several letters with accents ready-made in various places on the keyboard; they are accessible with the right-Alt key. For instance, ä is made by right-alt + q.

The keyboard layout files are in /usr/share/X11/xkb/symbols (for the new Xorg 7.0 system. On older systems, they may be in /etc/X11/xkb/symbols/pc).

Do not use the us_intl layout

In previous versions of this document, I recommended using the us_intl keyboard layout. This advice must be considered outdated. The alt-intl variant of the us layout does exactly the same job, but better. It makes it much easier to use multiple keyboard layouts (see below). If you want to know why, start   here.
NOTE: the us_intl layout no longer even exists in Xorg 7.0.

Apart from ‘variants’, the X keyboard system also knows ‘options’. One very useful option (which can be used with any layout and variant) is to enable the famous Unix ‘Compose key’. To make the right ‘Windows’ key function as Compose, you would call e.g.

setxkbmap us -variant alt-intl -option compose:rwin

In the config file this would be specified as follows:

Option “XkbLayout” “us”
Option “XkbVariant” “alt-intl”
Option “XkbOptions” “compose:rwin”

Using the Compose key you can enter combinations of characters, like ¥ (RightWindows, Y, =), and ß (RightWindows, s, s). The Compose key must be pressed and then released before you enter the other characters in the ‘Compose sequence’.

In a UTF-8 environment, lots of special characters can be made with the Compose key and the right-Alt key. With us(alt-intl) for instance, the right-Alt (often called AltGr) key, together with the ‘minus’ key, acts as a ‘dead macron’ key. So AltGr-minus, then a, produces ā. Good for transliteration of Japanese: Tōkyō. Hōrensō. Uchū Kenkyū. Also good for writing the Māori language of New Zealand, as well as for Latinists (‘the ablative of serva is servā’).

Both AltGr-e and AltGr-5 produce the Euro sign, € (Unicode number 0x20AC). AltGr-e is the EU standard, but where I live most keyboards are sold with the € sign engraved on the ‘5’ key. There are also several Compose sequences for the € sign (for instance Compose = c and Compose = E).

There are (as far as I know) only two UTF-8 locales with their own Compose file: Brazilian Portuguese and Greek. I do not know about the Brazilian, but the Greek file seems unnecessary. The Greek language (including Ancient Greek) is already covered in the international (=so-called ‘US’) file.

The full list of AltGr and Compose combinations is in the UTF-8 ‘Compose file’ (which is valid for virtually all languages and locales, not only en_US):

/usr/share/X11/locale/en_US.UTF-8/Compose

NOTE: The above location is of the Compose file is valid for Xorg 7.0 and up. On older systems it may be in /usr/X11R6/lib/X11/locale/en_US.UTF-8/Compose.

Some examples (there are lots more; see the Compose file. You can see them in the browser only if you have a font that includes them):

Compose ‘ a á (a with sharp accent)
Compose o r ® (‘registered’ sign)
Compose ? ? ¿ (Spanish upside-down question mark)
Compose = L ₤ (‘pound’ sign)
Compose / c ¢ (‘cent’ sign)
AltGr . i ı (Turkish dotless i). In general, AltGr . acts as a ‘dead dot above’ for letters which normally do not have a dot, but it removes the dot if it is already present.
AltGr Shift – ‘dead dot below’, useful for Sanskrit scholars who want to input letters like ṭ or ḥ.
AltGr ( o ŏ (o with breve, can also be made with Compose U o)
Compose # b ♭ (musical ‘flat’ sign)
Compose –. – (short dash)
Compose — — (long dash)
AltGr 9 ‘ (left single quotation mark)
AltGr 0 ’ (right single quotation mark)

Here also, AltGr means the right Alt key. Nobody seems to know why it is called AltGr; Google will turn up many different explanations.

The single quotation marks can be entered by means of AltGr 9 and AltGr 0 using the us keyboard (both intl and alt-intl variants) in modern versions of Debian. I am not sure about other Linux distributions. If, e.g., AltGr 9 does not work, try Compose AltGr ‘ < (a more complicated sequence, but more certain to work).

Word processors usually have a ‘smart quotes’ mechanism which changes apostrophes automatically into proper quotation marks (the mined editor also has this).

GTK2 and QT ignore the X Compose file; they have their own (much smaller) ‘built-in’ Compose tables (different for GTK2 and QT), because they have their eye on the ‘Windows market’. So by default, they do not rely on things which are present only in X. You have to tell them to use the X Compose file by means of environment variables (at least, that’s what I understand of it).

IMPORTANT NOTE: With so-called GTK2 (or GNOME) programs, including Mozilla, only a small number of Compose combinations are available by default; e.g. you cannot enter ŏ. In QT (or KDE) programs there are similar (but not quite the same) problems. To remedy this, you must put (in Debian) the following lines in /etc/environment:

export GTK_IM_MODULE=xim
export QT_IM_MODULE=xim

In Debian Sid at the moment, it is better to leave /etc/environment alone, and to install a package called im-switch. Run im-switch -c, and select default-xim from the list.

In other systems than Debian, there may be other methods for setting these ‘environment variables’.

GTK_IM_MODULE and QT_IM_MODULE must be set to xim if you do not have uim or scim installed. Otherwise, it is best to set them to uim or scim, as the case may be. The im-switch utility in Debian makes it easy to set these variables.

Another type of self development you can enjoy is playing casino games. Find all of the best and all of your favorites at an online casino.

To check the values of the GTK_IM_MODULE and QT_IM_MODULE variables, you can run the command

env | grep -i module

If you want to change some Compose definitions, or add new ones of your own, this can be done by means of a file called .XCompose in your home directory. The .XCompose file should begin with the line

include “/usr/share/X11/locale/en_US.UTF-8/Compose”

(Again, this is for Xorg 7.0 and up. In older systems you must include /usr/X11R6/lib/X11/locale/en_US.UTF-8/Compose).

You then follow this by whatever you want to add. For the format of the entries, see the original Compose file. For instance, I added

<Multi_key> <period> <minus> : “…” U2026 # HORIZONTAL ELLIPSIS

This produces an ‘ellipsis’ (three dots: …) by means of Compose-dot-minus. I found the Unicode number of the ellipsis by means of gucharmap. Note that in the Compose file itself, the Compose key is called Multi_key.

 

6.2 Multiple keyboard layouts

The X keyboard subsystem (xkb) allows for multiple keyboard layouts, called ‘groups’. You can use this for entering non-Latin alphabetical languages (like Greek, Russian, or Hebrew). You must specify a key which cycles through the different ‘groups’. As an example, try

setxkbmap us,gr,ru -option grp:lwin_toggle

This enables inputting Latin, Greek, and Russian. I set the group changing key to lwin (the left Windows key), but you could choose many other keys for this purpose (see for instance /usr/share/X11/xkb/symbols.dir).

Test it: press left-Windows once and type abcdef on your keyboard. You get αβψδεφ. Press left-Windows again and type abcdef: you get фисвуа. Press left-Windows once more and you have the us keyboard back. Instant Greek and Russian!

Why should there be a maximum? I don’t know.

In this example we have three groups that we can cycle through. The maximum is four groups. Scripts which can be made accessible by ‘group switching’ are (besides Greek and Russian): Arabic (ara), Hebrew (il), Thai (th) and probably many more.

For every group we can set up its own variant. The variants should be entered in the same order as the groups, separated by commas. If a group only uses its default (or ‘basic’) variant, we leave the variant blank. For instance:

setxkbmap us,gr,ru -variant alt-intl,,phonetic -option grp:lwin_toggle
Many non-western keyboard layouts have ‘phonetic’ variants, e.g. bg (Bulgarian), il (Hebrew). The variant of ara called buckwalter seems to be a ‘phonetic’ input system for Arabic.

This sets the us variant to alt-intl, leaves the Greek variant blank (nothing between the two commas), and sets the Russian variant to ‘phonetic’. The latter makes the Russian output correspond more closely to what you would expect when typing on a keyboard with ‘Western’ key markings. Typing abcdef now produces абцдеф. Options, as far as I know, apply to all available groups. This is how it would be entered in the config file, with a few extra twists:

Option “XkbLayout” “us,gr,ru”
Option “XkbVariant” “alt-intl,,phonetic”
Option “XkbOptions” “compose:rwin,grp:lwin_toggle,grp_led:scroll”

We have re-enabled the Compose key here. Also the Scroll Lock LED light on the keyboard will now be lit to warn you when any other group than the first (us in this case) is in use.

 

6.3 Using a character map

Then also, it is possible to enter Unicode characters from a character table. With kedit, gedit, bluefish etc., the gucharmap program can be used. In Openoffice, Insert-Special Character will also work. Character table input is usable if non-ASCII Unicode characters have to be entered only occasionally.

 

6.4 Input method frameworks

Input method frameworks have greatly simplified the problem of entering truly multilingual text, including input of ‘complicated’ scripts like Chinese, Japanese, and Korean. In any UTF-8 environment you can now easily enter text in almost any language, without changing your ‘locale’. I know of three input method frameworks: uim , scim , and IIIMF. I describe only the first two in some detail, because I never managed to get IIIMF to work properly (it is rather short on user-level documentation). However, Redhat/Fedora uses it, so it must be possible to make it work.

The GTK-2 immodule system can also be called an input method framework; its use is, however, limited to some Gnome programs only.

 

6.4.1 uim

uim is an abbreviation of ‘Universal Input Method’. The version of uim which I describe here is 1.4, which is available (at the time of writing) in Debian Unstable. This allows input of all kinds of languages in all programs, including Mozilla, Openoffice, and ‘text-mode’ programs running in an xterm. It is even possible to use it with ancient programs like xfig (you must run xfig with the -international switch).

How to do this: first you install uim. In Debian unstable this is done by means of

apt-get install uim uim-qt

This makes the uim available as an ‘empty framework’, i.e. with no language modules included. In Debian, earlier versions of uim included several language modules by default, such as Chinese Pinyin and Korean, but from 1.4, you must install all language packages separately. Some available language packages are (with their Debian package names):

Debian package Language
uim-anthy Japanese
uim-hangul Korean (hangul-2, hangul-3, Romaja)
uim-byeoru Korean (with hanja conversion)
uim-ipa-x-sampa International Phonetic Alphabet
uim-pinyin Chinese
uim-viqr Vietnamese

A lot more languages become available if you install uim-m17nlib. Many of them are ‘alphabetical’ languages such as Greek and Russian, which can be enabled much more comfortably by means of the xkb facilities (see section 6.2) However, uim-m17nlib also contains phonetic conversion engines for lots of Indic languages.

uim itself is a just a dummy package. It pulls in the real packages comprising the uim system: libuim5, uim-common, uim-fep, uim-gtk2.0, uim-utils, uim-xim, libuim-data, and im-switch. For some reason uim-qt is not installed automatically, so it must be specified separately. About im-switch, see below.

During the installation of uim-anthy, Debian asks if you want to add extra words to anthy’s dictionary:

Now, these following dictionaries are available.
[ ] base.t: Anthy specific words which are compatible with cannadic.
[ ] extra.t: Anthy specific words which are not compatible with cannadic.
[ ] 2ch.t: Dialects used in 2ch, the biggest Japanese web discussion group.

I don’t think this question is really important, but I selected all of them anyway.

It used to be quite difficult to set up the correct commands and environment variables for starting uim, but now it has become easy, thanks to im-switch. Just call as user

im-switch -s uim-toolbar

or

im-switch -s uim-systray

The choice depends on whether you want uim’s user interface (‘toolbar’) to float around somewhere on the screen, or to be incorporated in the ‘system tray’ of your window manager. uim-systray works well with most window managers in Linux, but not with all; especially problematic is icewm. If you find you cannot get a proper uim toolbar in the systray, use uim-toolbar (the ‘floating’ version).

After you start X it will take a few seconds for the toolbar to become visible. The ‘floating’ version of uim’s user interface looks like this:

 

This toolbar becomes fully visible only when you start some program which allows text input, like xterm, Mozilla, etc. Note that in different versions of uim the symbols in the toolbar may look slightly different.

The two icons on the right (the two bars and the ‘tools’) all do the same thing when right-clicked; a configuration menu comes up:

 

By selecting ‘preference’ you can choose which languages you want to have available, and what the default language should be. I normally set this to ‘direct’. Through ‘preference’ you can also customise the toolbar.

When left-clicked, the ‘two bars’ icon leads to a ‘language switcher’ menu, but first you have to ‘enable’ languages using a preferences menu, which you access by left-clicking the ‘tools’ icon (or through the menu shown above). There is a line ‘enabled input methods’ with a button marked edit next to it.

In general, in uim, you switch between ‘straight’ input and a ‘foreign’ language by means of shift-space, or by clicking some icon in the toolbar. Using the ‘direct’ input method, and also when you have selected a ‘foreign’ language but not activated it (e.g. when you use 直接入力 if the selected language is Japanese), it seems as if uim is simply not there; all xkb facilities (like keyboard switching, dead keys for accents, and Compose key combinations) just work.

uim effectively solves the multilingual input problem on Linux. Using the ‘language switcher’ (the ‘two bars’ button) you can even select a different input method for each running application. On your screen you can, for instance, have one xterm in which you input Russian, and another one in which you input Japanese.

anthy is an excellent Japanese input method (but other methods are also available, like prime). uim itself has many input methods, and the m17n library has many more. uim’s input method files, written in the Scheme (= LISP) language, are in /usr/share/uim. The m17n files are in /usr/share/m17n.

 

6.4.2 scim

scim (an abbreviation of ‘Smart Common Input Method’) is another input method framework, very similar to uim; the two systems seem to be in a state of friendly competition.

Installing scim involves apt-getting the scim package itself and some language modules. I installed, using apt-get, three packages:

  • scim
  • scim-m17n
  • scim-anthy

Actually, the first one can be omitted, because the others pull it in anyway. If you want to input Chinese you should install other packages according to your needs, like scim-chinese, scim-pinyin, or scim-chewing. There is also a collection of input engines covering many Indic scripts: scim-tables-additional.

scim and uim can coexist on your system, but for each user only one can be used at any time. To make scim your default input method framework, im-switch can again be used. As user, call

im-switch -s scim

IMPORTANT: scim by default works only in an en_US.UTF-8 locale! If you use a different UTF-8 locale (or locales), edit the first line of /etc/scim/global, so it reads (for instance)

/SupportedUnicodeLocales = en_US.UTF-8,en_GB.UTF-8,ja_JP.UTF-8

Alternatively, this can be specified on a per-user basis in ~/.scim/config.

After an X restart, SCIM shows its presence by means of a ‘keyboard’ logo in the taskbar (here shown in the xfce4 taskbar, next to a ‘Skype’ logo and an ‘Opera’ logo):

 

This works in most window managers. As with uim, you may have problems getting it to work in icewm. For KDE, there is a special scim version called skim; I haven’t tried it, and I don’t see why it is necessary, because the standard scim works ok.

Activate scim by means of control-SPACE. The SCIM panel comes up. This is not integrated in the taskbar, it floats around. The appearance of the SCIM panel, as well as of the logo in the taskbar, depends on the language you have selected:

 

So while uim has either taskbar icons or a floating window, scim by default has both. Either, or even both, can be disabled in the scim preferences menu (right-click).

Like uim, scim system works ‘everywhere’ (Mozilla, Openoffice, QT programs, programs in xterm …). You can get good UTF-8 input in many languages. However, scim has, IMHO, a few weak points compared to uim. In particular, it does not have uim’s strong point of ‘being invisible when not active’.

Because of this, at the moment I prefer uim over scim. In the past, uim used to have some instability problems, but now (version 1.2+) they seem to have disappeared.

  • With the ‘English/European keyboard’ method you can input accented letters, and you have xkb-type keyboard switching available. However, the default is not ‘English/European’ but ‘English’ (which does not allow input of accented letters), and it seems you cannot change this default. You keep having to switch to ‘English/European’ manually if you want to input accents by means of dead keys.
  • scim uses its own internal version of the Compose file, ignoring the system-wide one. Some combinations in the system-wide Compose file may therefore not work, or have different effects; also, customisations, made by means of a ~/.XCompose file, will be ignored.
  • In ‘text-mode’ programs (running in an xterm), conversion of keystrokes to CJK characters takes place in a separate small window (no ‘on the spot’ conversion). I personally think that ‘on the spot’ is the most elegant method.

It is supposed to be easy to create new input methods for use with scim.

 

6.4.3 IIIMF

When you get IIIMF properly installed (which is not too easy), it can be switched on and off with CONTROLSPACE (just like scim), and the language to be entered can be selected by means of CONTROL-ALT-L. You then get a language menu.

Many language modules are available for IIIMF; it appears to be especially good for entering Indic scripts. For Japanese it seems there is only canna available (among the free systems).

Documentation for installing and using IIIMF can be found here; this is part of a much larger work, a very comprehensive Guide to Localisation, which is well worth reading.

 

6.4.4 The GTK-2 immodule system

If you right-click in an input field in a GTK-2 program, you get a menu which offers a choice of ‘input methods’. Some of them are very nice, sometimes nicer than similar input methods offered by uim, scim, and the m17n library.

You don’t have to do anything to install the immodule system; it comes automatically with the GTK-2 library which you most likely already have. The trouble with the immodule system is, however, that it only works with GTK-2 programs, and not even all of them; it does not work with Mozilla/Firefox, for instance. On my system, there are very few programs which can use it (basically only gucharmap and bluefish).

 

6.5 Notes on specific languages

 

6.5.1 Russian

The simplest way to enter Russian is to use the xkb facilities (see section 6.2). You can specify a ‘phonetic’ variant of Russian when needed.

6.5.2 Korean

Korean uses an alphabetical script; however, the letters are not arranged linearly, but in two-dimensional square patterns corresponding to syllables. That is perhaps why simple ‘keyboard switching’ does not suffice for Korean. There is no Korean keyboard layout file in the xkb system. You have to use an input method, either by itself or as part of an input method framework like uim or scim.

Using the ‘Hangul 2’ input methods available in the various input method frameworks, you can input, e.g., 서울역 or seoulyeog (‘Seoul Railway Station’, seo-ul-yeog, 3 syllables) as tjdnfdur. The problem is that you have to get used to the Korean Hangul-2 keyboard layout first. It is not ‘Latin-based’; e.g. , the Korean letter M, sits under the ‘A’ key. You can investigate the layout by experiment:

This table uses the standard Revised Romanization of Korean system.

Q

b
W

j
E

d
R

g
T

s
Y

yo
U

yeo
I

ya
O

ae
P

e
A

m
S

n
D

–/ng
F

r/l
G

h
H

o
J

eo
K

a
L

i
Z

k
X

t
C

ch
V

p
B

yu
N

u
M

eu

You can make ‘double’ letters by means of the shift key; e.g. Shift-Q delivers . Shift-O gives (yae), shift-P gives (ye).

Syllables are composed automatically when you input the correct single letters. You may have to follow the last letter in your text by a SPACE to indicate the completion of the last syllable.

Typing gamsahabnida requires 12 keystrokes both with hangul-2 and with a Romaja keyboard. For some letters, like yeo (), Romaja requires more keystrokes than Hangul-2. But (on average) not a lot more, and you save learning a new keyboard layout.

Because of the alphabetic nature of the Korean script, an input method in which ‘one key means one letter’ is the most convenient for the Koreans themselves. The Korean keyboard layout was already used on mechanical typewriters. But there are also input methods for Korean which do not require memorising the Korean keyboard layout: the ‘Romaja’ input methods. They are not used by the Koreans themselves, only by foreigners (in contrast to the Japanese ‘Romaji’ input methods, which are used by most Japanese). With a Romaja method, you type using the Revised Romanization of Korean system. To type ‘thank you’ in Korean (감사합니다) you would enter gamsahabnida (pronounced gamsahamnida).

Hangul-3 input

A Korean syllable can basically have two forms: (a) consonant + vowel (b) consonant + vowel + consonant. For example, (a) = + , (b) = + + . In case (b), the first consonant is graphically located at an upper position, and the last consonant at a lower position. One method of having a typewriter strike a consonant type at two different positions of the paper, was to assign two different keys to a single consonant, one for the upper position, and the other for lower. This sort of keyboard layout, originally from mechanical typewriters, is Hangul-3. You can see one example here. In the displayed keyboard image, K is the upper , and X is the lower , for example. On a computer, this manual distinction is unnecessary any more and the Hangul-2 layout is the current standard, but there are people insisting on Hangul-3 even on a computer. They claim that Hangul-3 layout is more efficient than Hangul-2, like people who love the Dvorak keyboard. (explanation by Park Jae-hyeon)

 

6.5.3 Classical Greek

We want to input accented Greek characters: δεῦρο φίλη, λέκτρονδε, τραπείομεν εὐνηθέντες (‘Come to bed, darling, and let’s have fun…’ – Ares to Aphrodite in Odyssey, Book 8).

The easiest way to do this is by using the xkb facilities, using a special key to switch your keyboard to Greek. Just specify the gr layout with the polytonic variant (see section 6.2). When you have this set up, the Greek accents can be typed as follows:

Accent key result (on α)
acutus (oxía) ; ά
gravis (varía)
perispomenon (perispoméni) [
iota subscriptum (ypogegramméni) ]
spiritus asper (dasía)
spiritus lenis (psíli) :

By ‘key’ I mean the keys as marked on a US keyboard. On other keyboards, the keys which are in these locations will probably be marked differently. Please experiment. E.g. on a German keyboard, you would use shift-Ö instead of :, and shift-Ä instead of ".

Accents can be combined: if you type ]["a with the keyboard switched to Greek, you get . Not every order of the accent combinations is supported; you have to experiment a bit. Generally, the iota subscriptum comes first, the breathing sign last.

A font which includes all accent combinations for Classical Greek is, for instance, FreeSerif. The efont bitmap fonts (for xterm) also have them.

Correcting a ‘polytonic’ bug in older systems

In older versions of Debian Sid (and maybe in other distributions) the gr layout contains a bug which prevents the ‘breathing’ signs from being entered in a non-Greek locale. To solve this, become root, and edit /usr/share/X11/xkb/symbols/gr (before Xorg 7.0: /etc/X11/xkb/symbols/pc/gr), changing

key <AC10> { [ dead_acute, dead_horn ] };
key <AC11> { [ dead_grave, dead_ogonek ] };

to

key <AC10> { [ dead_acute, U0313 ] };
key <AC11> { [ dead_grave, U0314 ] };

The problem with the Greek breathing signs is that they must have identical descriptions in the Compose file and in the keyboard (xkb) file. These files seem to be maintained by different persons, who make changes all the time without consulting each other.

AARRGH! Developers, please do not fix things which are not broken! (note 10 August, 2007)

The change suggested in the box above was implemented by several distributions, providing Linux with superior Classical Greek capabilities (compared to Windows).

But in the most recent versions (at least the most recent versions of Debian, and Ubuntu Feisty) things have gone wrong again. The ” and : keys, which should act as ‘dead breathing signs’, have now become ‘combining diacriticals’, to be typed after the letter they should combine with. And because ‘combining diacriticals’ are not well supported by most fonts and most applications (see Section 4.11), the results have become pretty much unpredictable.

I do not know which change is responsible for this bug, but if you find you have it, you can cure it by becoming root and editing the Compose file. Replace U10000313 by U0313 and U10000314 by U0314 everywhere; then restart X. Thanks to André Alonso for spotting this problem.

NOTE (July 2008): After things went wrong again in January, it seems that the developers have finally agreed on a solution. In modern versions of Debian Sid and Ubuntu, the Greek accents work ‘out of the box’.

 

6.5.4 Indic scripts

One way to enter the various Indic scripts is to use the xkb system. There is one keyboard layout file (for many Indic languages) called in, with several variants (see section 6.2).

The Indic scripts are fascinating. How boring looks the Latin letter ‘I’ compared to its Tamil equivalent: இ !

script variant example
Devanāgarī (none; default) देवनागरी
Bengali ben বাংলা
Gujarātī guj ગુજરાતીલિપિ
Gurmukhi guru ਗੁਰਮੁਖੀ
Kannaḍa kan ಕನ್ನಡ
Malayālam mal മലയാളം
Oriya ori ଓଡ଼ିଆ
Tamil tam தமிழ்
Telugu tel తెలుగు
Urdu urd اردو

Using xkb, you ‘switch the keyboard’, so one key will produce one letter in the chosen alphabet. Because the Indic languages have more than 26 letters, in many cases the shift key has to be used (the Indic scripts have no upper case, so the shifted keys are freely available). So for instance in Devanāgarī, one key (L) will produce ‘ta’, another (shift-L) ‘tha’. The keyboard layout is shown  here.

The other way, probably more convenient for people who don’t have an Indic keyboard, is to use an input method for the chosen language, for instance as provided by the m17nlib methods that are available for scim and uim. These methods are based on the Latin alphabet, so you make ‘ta’ by typing ta, and ‘tha’ by typing tha. Indic input methods provide ways to input letters with the special phonological characteristics of the Indic languages; e.g. the ‘retroflex na’ (transcribed by linguists as ‘ṇa’ with a dot under the n) can be input as Na. The details are beyond the scope of this document. See, for instance, m17nlib’s Devanāgarī input method source code: /usr/share/m17n/hi-itrans.mim.

 

7. Copying / pasting

Nowadays, copy/paste of Unicode texts between applications mostly works. Until recently pasting Unicode into Openoffice often produced strange results (e.g. a pasted hiragana HA was displayed as \x{306f} in Openoffice). It also depended on where you were pasting from. But from Openoffice 1.1, copying & pasting works fine.

 

8. Conversion of texts

There are several programs for text conversion to and from UTF-8. The most general seems to be iconv. It uses -f (‘from’) and -t (‘to’) options to indicate the coding systems from which and to which the conversion must be made. Usage (for example):

iconv -f iso-2022-jp -t utf-8 yuki78.html >yuki78u.html

This takes a Japanese JIS (i.e. iso-2022-jp) encoded file called yuki78.html, converts it to UTF-8, and writes the result to yuki78u.html.

The iconv code for the old PC-DOS character set (sometimes called ‘extended ASCII’) is 437.

iconv understands all text encodings ever invented. For a list, type iconv -l (the encoding names, it seems, are case insensitive; utf-8 is the same as UTF-8). If there is one type of conversion which you use often, it may be useful to construct an alias (or shorthand) for it, e.g. by entering in /etc/profile the line:

alias myconv=”iconv -f iso-2022-jp -t utf-8″

Then you can call simply

myconv yuki78.html >yuki78u.html

 

9. Printing

At the moment the state of UTF-8 printing seems to be as follows. This is on a system with ghostscript and lprng + magicfilter. I do not know anything about CUPS. I just hope that what follows also applies to CUPS.

 

9.1 Printing pure text files

Printing UTF-8 text normally requires going through PostScript/ghostscript, as no printer natively understands the entire UTF-8 character set. The trick is to find a UTF-8 capable text-to-PostScript converter that makes the result still look like ‘text’, i.e. monospaced, tabs interpreted correctly, and with boxes made by means of ‘box draw characters’ looking OK. A text print system should also not add any extra ‘fluff’ like page numbers and headers/footers; these should, if they are necessary, be added by the program generating or formatting the text, not by the text printer itself.

 

9.1.1 The best way: paps

paps, by Dov Grobgeld, is a text-to-PostScript converter for UTF-8 text. A Debian package (now version 0.6.7) has been in ‘unstable’ since mid-April 2006. If your distribution does not have it, it is easy to compile it from source by the now standard ./configure, make, and (as root) make install. It installs itself in /usr/local. The latest version, 0.6.7, has a few new options, like ‘printing on U.S. letter paper’ (default is A4), headers/footers (but you do not have to use them), and interpretation of tabs and form feeds.

Called as follows: paps --family Freemono --font_scale 10 < myfile.txt > myfile.ps, it produces PostScript files which print beautifully. It looks just like ‘traditional’ text printing, with the vital difference that now the whole UTF-8 character set can be printed (only the parts that you have fonts installed for).

  • When you have the latest version of Pango, you also need the latest version of paps (0.6.6+), otherwise there are problems printing tab characters. (note 4 May 2006)
  • Take care: the very latest version of paps (0.6.7) breaks backward compatibility by combining the --family and --font_scale options into one new option, called --font. So now you call e.g.
    paps --font "Freemono 10" < myfile.txt > myfile.ps

paps can be used in ‘input filters’ for the lpr system. I set it up this way, so now I can print UTF-8-coded multilingual text simply by lpr mytest.txt, or cat mytest.txt | lpr. I won’t be surprised if paps eventually becomes the standard text-to-PostScript utility on Linux systems (replacing a2ps, for instance).

NOTE: For compiling paps, a Pango development library (in Debian e.g. apt-get install libpango1.0-dev) is needed; from version 0.6.1 you must also install a document system called Doxygen, which produces some HTML documentation pages for programmers. But at user level, you do not need much documentation. paps --help tells all there is to know.

 

9.1.2 Other solutions for pure text printing

  • The traditional way is to use uniprint (which is part of the yudit distribution). It is difficult to make it print UTF-8 with a proper ‘monospaced’ look.
  • Another interesting program is cedilla, by Juliusz Chroboczek. By doing clever tricks with PostScript it prints UTF-8 encoded accented Latin, Russian and Greek beautifully. It handles multiple combined diacriticals very well. However, it cannot print UTF-8 apart from (extended) Latin, Russian, and Greek.
  • An easy way is to use Openoffice. Openoffice can be used as a command line printer using the ooffice -p command. So, to print a UTF-8 coded text file, called, say, mytest.txt, containing accented letters, kanji, etc., you can call from the command line
    ooffice -p mytest.txt

    The graphical user interface of Openoffice does not come up, so this is reasonably fast. The print results are very good on my system; I probably have good default fonts. At the time of writing there are, however, still a few bugs. One of them is fairly serious: you cannot print HTML source text from the command line (perhaps it does not like the > or the < signs). You get an Openoffice error window if you try.

 

9.2 Printing from Openoffice

For printing its own documents, Openoffice makes a PostScript file for each print job. The file seems to specify the shape of each letter to be printed. So, instead of saying ‘print a letter O in such-and-such a font’ it says something like ‘draw a circle of such-and-such a size and thickness’. In fact it considers all characters to be just geometric shapes; the geometrical information comes from the font files. This works for all fonts, including the most exotic ones; so Openoffice can print anything (including all UTF-8 characters), as long as you have the fonts. Ideally all printing on Linux would work like this.

Printing from Openoffice with lprng

The latest versions of Openoffice support the CUPS printing system. This may mean that you cannot print if you don’t have CUPS. If your Openoffice refuses to print after an upgrade, you should make sure you have a line in /etc/openoffice/openoffice.conf:

export SAL_DISABLE_CUPS=1

 

9.3 Printing from Mozilla products

Printing from Mozilla (and its derivatives: Firefox, Thunderbird, Galeon …) has always been worse than on Windows. However, since the arrival of Firefox 3.0 (called Iceweasel 3.0 by Debian), printing seems to work fine. In earlier versions of this document I mentioned several bugs, related to CTL languages, math printing (MATHML), and mixing CJK characters with Latin characters on the same line; but with version 3.0, they have disappeared.

Printing from Firefox 3 (Iceweasel 3) with lprng

A Dark Conspiracy is trying to force Linux users to use CUPS instead of lprng. Out-of-the-box, firefox 3 can only print through CUPS. To get good-old lpr back, you must put in ~/.gtkrc-2.0:

gtk-print-backends = "file,lpr,cups"

This can also be put in /etc/gtk-2.0/gtkrc (with system-wide effect).

 

9.4 Some tweaks for printing from Mozilla products

In Mozilla and derivatives, a user-side .css file can be used for some customisation of printing. This can be done as follows: Go to your chrome directory (this is somewhere deep inside your ~/.mozilla/default). There is a file called userContent-example.css. Rename this to userContent.css (or create userContent.css if it does not exist), and edit it, adding a so-called ‘media selector’ at the end:

@media print {
    body,table  {
        color: black !important; 
        font-size: 11pt;
        line-height: 13pt;
        }
    :link, :visited {
        color: black !important; 
        text-decoration: underline;
        }
    .moz-text-plain, .moz-text-flowed {
        font-size: 10pt !important;
        font-family: Freemono,"Kochi Mincho" !important;
        line-height: 11pt;
        }
    }
‘Leading’ is pronounced ‘ledding’ (in the days of old, this effect was achieved by inserting strips of lead between lines of type). The @media print ‘selector’ command allows you to override some default print settings. It may be worth your while to experiment with this. The example above, for instance, sets the font size for printing to 11 points (for pages that do not themselves specify the print font size – but this page does, it sets it at 11 points – ) print font size is related to the screen font size (which you set in the Edit, Preferences, Appearance, Fonts dialog). I made the line height two points bigger than the font size. This introduces some extra space between printed lines. This is called leading by typographers, and in many cases greatly improves the appearance of the printed page (especially if you specified FreeSerif as your default font).

I myself also put outside the @media print selector:
p color: black !important;
Now sites with ‘cool’ design (light grey letters on white) are much more readable… apart from the even ‘cooler’ sites with black background, of course. But I never read those.

It also specifies that links should be printed in black (instead of the default blue, which often looks rather anaemic on black and white printers). I added the !important directive, so links will be printed in black even if a Web page actually wants to make them blue. Text is also forcibly printed in black.

The section with moz-text… allows you to control how plain text e-mail messages are printed; it seems this is the only way to do this. I specified 10 points Freemono with one point of ‘leading’, and for characters not provided by Freemono (in my case this means Japanese messages), a Mincho font.

Each time you change the .css file, Mozilla must be restarted for the changes to take effect.

Why are inches used in one menu, and metric units in another? Oh well.

For optimal results, you also have to manage the print margins properly. There are two independent sets of ‘margin’ settings (which do slightly different things) in Mozilla/Firefox. The settings in the menu File, Print, Properties, Gap from Edge of Paper to Margin determine the positioning of the header and footer. I changed Left and Right from 0.04 inches to 0.3 inches, and Top and Bottom to 0.2 inches. Another margin setting is in the menu File, Page Setup, Margins & Header/Footer. This determines the margins of the text (e.g. Left 18 millimeters, Right 15 mm, Top 12 mm, Bottom 20 mm). You need to experiment a bit to get it right, depending on your printer 🙁.

I tried to make this page as printer-friendly as possible, using ‘print CSS’, but several things are still likely to go wrong with the present generation of browsers.

Comments
Simplue WordPress theme, Copyright © 2013 DicasLivres.org Simplue WordPress theme is licensed under the GPL.