New multilingual UTF-8 input methods for Linux

1. UTF-8 + Japanese in Mozilla

The uim system (‘Universal Input Method’) can be used as your Japanese conversion interface program. It is an alternative to im-ja (and the now-obsolete kinput2).  anthy can be used as the conversion engine (replacing canna, skk, etc.). With the uim/anthy combination you can get Japanese input in GTK2 programs, including Mozilla, and also European accents, macrons, etc. So in an e-mail you can explain that Uchisaiwaichō = 内幸町, and été = .

How to do this:

  1. Install uim and anthy: in Debian unstable this can be done by means of
    apt-get install uim anthy
    
  2. Then make sure that the following commands are run when you start X:
    export XMODIFIERS=@im=uim-anthy
    export GTK_IM_MODULE=uim-anthy
    uim-helper-toolbar-gtk-systray&
    
    Exactly how you can execute these extra start-up commands depends on your Linux distribution. In Debian, I did this:

    1. Become root.
    2. Go to the directory /etc/X11/Xsession.d.
    3. Create a file called 93fep containing the following:
      #!/bin/sh
      export XMODIFIERS=@im=uim-anthy
      export GTK_IM_MODULE=uim-anthy
      uim-helper-toolbar-gtk-systray&
      
    4. Do chmod +x 93fep.
    5. Restart X.
    6. Check that the above commands actually have been executed, by means of env and ps aux.

    This is probably not The Right Way. It is also a system-wide setting (most people would prefer a per-user setting). But it works.

    In other systems than Debian, there may be other methods available, like editing ~/.xinitrc. It appears that in Mandrake, the extra commands are set in a file called ~/.i18n.

  3. You will get a clickable selector on the ‘taskbar’, looking very much like Microsoft’s ‘Japanese IME’. With this, you can select various input methods in GTK2 programs (including Mozilla). Select ひらがな to input Japanese; select 直接入力 (‘direct input’) to have direct keyboard access (with the possibility of entering accented European letters, using the AltGr and Compose keys, etc.). Pressing SHIFT-SPACE also switches between ‘straight’ and ‘Japanese input’ mode.The selector widget also allows switching to a Japanese kana keyboard layout (in which, e.g., typing QWERTY results in たていすかん). But in fact even most Japanese don’t use this keyboard layout; they type, e.g., ta to get .

You don’t have to worry much about ‘locales’ as long as your locale is a UTF-8 one (e.g. en_GB.UTF-8). OpenOffice is an exception; see section 3.

2. Other languages in GTK2 programs with uim/anthy

In GTK2 programs like gucharmap and bluefish, right-clicking produces an ‘input method’ selection menu (this menu is not available in Mozilla). If one of those (apart from uim) is chosen, the selector widget in your task bar becomes like a lowercase o. Then when you actually switch the chosen method on (by pressing Shift-space, or by clicking on the widget) it becomes an uppercase O.

Some of the input methods in the menu do not work if you only have uim and anthy installed; they require installation of other components. Among the ones that work straight away are:

  • uim-ipa: International Phonetic Alphabet. E.g. ng becomes ŋ, zh becomes ʒ.
  • Korean (select hangul2). This works the same as with im-ja, but you complete syllables by means of <space> or <enter>. So you can type ‘Seoul’ (서울) by entering tj<space>dnf<space>.
  • Chinese Pinyin (simplified). I do not know Chinese myself, but it seems to work: nimenhao becomes 你们好.
  • Chinese Pinyin-Big5 (traditional). It appears you have to input the tones: ni3men1hao3 becomes 你們好.

The input selection menu also offers some languages which come with GTK2 by default, like Cyrillic, Amharic, and Broken Thai (I always wonder why they include this if it is broken).

3. Japanese input in Openoffice using uim/anthy

To get Japanese input in non-GTK2 programs (the most important program in this category is Openoffice) with uim/anthy, you have to use the ‘bridge’ program uim-xim. This is automatically installed by Debian when you call apt-get install uim anthy as described in Section 1. uim-xim connects uim to the XIM input system expected by Openoffice (this description is vague because I do not understand the technical issues behind this).

uim-xim&
LC_CTYPE=ja_JP.UTF-8 oowriter

Note the & sign which must be put after the call to uim-xim. uim-xim& can also be called in your X startup. Unfortunately, this is a ‘monolingual’ solution. No more European accents. This situation will, it seems, only improve when Openoffice starts allowing the use of other input systems than XIM. This may happen fairly soon.

4. UTF-8 + Japanese input in text-mode programs running in xterm

You can get Japanese input in text-mode programs by installing (in addition to uim and anthy) a program called uim-fep (FEP being the abbreviation of ‘Front-End Processor’). This is much easier than the old mess with kterm, kinput2, canna, etc. Just install the Debian package:

apt-get install uim-fep

You must also create (if you do not have it) a file called .uim in your home directory. It must have a line:

(define-key anthy-latin-key? '("<Control>j" generic-off-key?))

(thanks to Hiroyuki Tokunaga for this information). Then, in any utf-8 capable xterm (within a UTF-8 locale), you just call

uim-fep 

There is now a 直接入力 (‘Direct Input’) label on the bottom left in the xterm. You can run UTF-8 capable text-mode programs (like joe), enter accented characters, etc. But if you now press Control-j, the label changes to ひらがな and you can enter Japanese; type words or phrases; they become hiragana; by pressing SPACE you can select how they are to be changed into kanji. To get back to direct input (直接入力) mode, you press Control-j again.

To get out of uim-fep completely (the label disappears): press Control-d while in 直接入力 mode.

NOTE: The ‘on’ key (which puts you into ひらがな mode) is Control-j by default. In the example above, I made the ‘off’ key also Control-j. If you run programs in xterm which need the Control-j key themselves, it is better to use a different ‘off’ key (e.g. Control-l).

5. Dictionary maintenance in anthy

anthy comes with a program called anthy-dic-tool, with which a user dictionary, containing conversions for specialized terms that you need, can be added. A wiki page exists which explains (in Japanese) how to use it. Each user has a personal ~/.anthy directory into which a skeleton for a user dictionary must be placed, using the commands:

anthy-dic-tool --dump > ~/.anthy/private-dic.src
chmod 600 ~/.anthy/private-dic.src

(Just copy the first line to the command line, press ENTER, then do the same with the second line).

Now you have a file called private-dic.src in your .anthy directory, which must be edited to add new words to the conversion system. The file itself contains some comments which explain how it must be edited (as does the wiki page mentioned above). Unfortunately, the file is (and must be) in EUC-JP encoding. In a UTF-8 environment this involves some use of iconv for converting it to and fro. It is annoying that a ‘modern’ system like anthy still uses a legacy encoding for its data, but anthy itself works fine in a UTF-8 environment.

For every new word you have to enter rather a lot of information (whether it can combine with , with する, etc.). This is of course to enable conversion of whole phrases and not just single words.

For instance to enable conversion of えんがくきょう to 圓覺經 (1), you would put in private-dic.src:

えんがくきょう 1 圓覺經   (See note below)
品詞 = 名詞   (word type: noun)
な接続 = n   (cannot be followed by na)
さ接続 = n   (cannot be followed by sa)
する接続 = n   (cannot be followed by suru)
語幹のみで文節 = y   (can be a phrase by itself)
格助詞接続 = y   (can be followed by particles — like no, etc.)

This is an entry for a noun. Other entry types (like proper names) take fewer lines. Entries must be separated with blank lines. NOTE: the ‘1’ in the first line is a ‘frequency’ indicator; ‘1’ seems to work OK.

The last step then is

cat ~/.anthy/private-dic.src | anthy-dic-tool --load

This modifies a file called ~/.anthy/last-record2_default, which therefore is the user dictionary itself.

Comments
Simplue WordPress theme, Copyright © 2013 DicasLivres.org Simplue WordPress theme is licensed under the GPL.