localeSupport(2)

[Home] [Commands] [Variables] [Macro-Dev] [Glossary]

LOCALE SUPPORT

Locale support within MicroEmacs handles the hardware and software configuration with respect to location, including:-

Character Sets
Word Characters
Keyboard Support
Language
Spell Support

There are many other locale problems which are not addressed in this help page. Supporting different locale configurations often requires specific hardware (a locale specific keyboard) and knowledge of the language and customs of the region. This makes it a very difficult area for one localized development team to support, as such, JASSPA rely heavily on the user base to report locale issues.

Note on Names and IDs

The language name is not sufficient to identify a locale (Mexican Spanish is different to Spanish Spanish) neither is the country name (two languages are commonly used in Belgium), so before we've really started the first problem of what to call the locale has no standard answer! Call it what you like but please try to call it something meaningful so others may understand and benefit from your work.

In addition, the internal id and data file names have a length limit of just four characters due to the "8.3" naming conversion of MS-DOS. The standard adopted by JASSPA MicroEmacs for the internal locale id is to combine the 2 letter ISO language name (ISO 639-1) with the 2 letter ISO country name (ISO 3166-1). Should the locale encompasses more than one country, then the most appropriate country id is selected.

Character Sets

A character set is the mapping of an integer number to a display symbol (i.e. character). The ASCII standard defines a mapping of numbers to the standard English characters, this standard is well defined and accepted, as a result the character set rarely causes a problem for plain English.

Problems occur when a language uses characters located outside the ASCII standard, such as letters with accents, letters which are not Latin based (e.g. Greek alphabet) and particularly with languages that have a large number of 'letters' or logograms (e.g. Chinese) which require multiple bytes to encode. There are many different character sets to choose between and if the wrong character set is selected then the incorrect character translation is performed resulting in an incorrect command behaviour and/or character display.

For several reasons the core of MicroEmacs can only support a single byte character set so cannot support languages like Chinese.

Internal Character Set

Before the adoption of international standards such as Unicode, each operating system vendor would typically create their own character set, or code page, that best met the needs of their customers. As a result there are many in use making it impossible to properly support them all directly. The use of a standard internal character set allows for the creation of a single set of spelling rules and dictionary that can be used by all compatible character sets with a simple mapping.

MicroEmacs does not intrinsically require, nor is it biased towards any one character set or group of sets, however the main Windows code pages (i.e. CP-1252, CP-1253 etc) have been adopted as these typically have the most defined characters providing the best language support. Therefore all supported languages use the most appropriate Windows character set for their spelling rules and dictionaries.

Displayed Character Set

The current display character set being used is not determined by the current operating system, but by the font being used in MicroEmacs (which may be limited be the OS). For example, UNIX X-Term systems typically use ISO-8859 based fonts but fonts using other code pages, such as CP-1252 which have numerous additional important characters such as a Euro symbol, can be found in pcf format, installed and used by MicroEmacs.

If the character display looks incorrect in MicroEmacs, such as text containing incorrect or weird symbols, use the insert-symbol(3) command the review the display character set actually being used and verify what it is, then use the platform page of user-setup(3) command to check and correct the font and character set M-S is configured to use.

If the problem persists (i.e. because the character set used to write the text is not supported on your current system) use the change-buffer-charset(3) command to convert the text to the current display character set.

If your character-set is not supported then first make sure that MicroEmacs will draw all of the characters to be used. By default MicroEmacs does not draw some characters directly as the symbol may not be defined. When a character is not defined then there will typically be a gap or space in the text at the unknown character, in some cases there may be no space at all which will make it very hard to use. The insert-symbol(3) command (Edit->Insert Symbol) is a good way of looking at which characters can be used with the current character set.

For a character to be rendered (when in main text) or poked (drawn by screen-poke(2) or osd(2)) is defined by the d and p flags of set-char-mask(2) command. The characters that are used when drawing MicroEmacs's window boarders or osd dialogs is set via the $box-chars(5) and $window-chars(5) variables.

MicroEmacs attempts to improve the availability of useful graphics characters on Windows and UNIX X-Term interfaces. The characters between 0 and 31 are typically control characters with no graphical representation (e.g. new-line, backspace, tab etc.) if bit 0x10000 of the $system(5) variable is set then MicroEmacs renders its own set of characters. These characters are typically used for drawing boxes and scroll-bars.

With so many character sets, each with their own character mappings, then the problem of spelling dictionary support is also tied to the locale. MicroEmacs uses the ISO standard character sets (ISO 8859) internally for word and spelling support and therefore a mapping between the ISO standard and the user character set is required. This mapping is defined by using the 'M' flag of the set-char-mask(2) command.

The user may declare the current character set in the platform page of user-setup(3). All the settings required for supporting each character set may be found in the charset.emf macro file, so if your character set is not supported, this is the file to edit.

Unicode support

Unicode support within MicroEmacs is very limited due to its single byte character set limitation, however flag c of command set-char-mask(2) can be used to map characters 0x80 to 0xff of the display character set to their Unicode equivalent. Correctly mapping these characters allows better handling of text transfer between MicroEmacs and the system clipboard (see bit 0x800000 of $system(5)) and allows conversion to and from UTF-8 using the change-buffer-charset(3) command.

The use of character 0x07 (Bell) is used to denote an unsupported character while change-buffer-charset has an option to preserve unsupported Unicode characters by encoding them to a 3 or 5 digit hexadecimal number prefixed with the character 0x01 or 0x02. Care must be taken to avoid splitting or corrupting the resultant 4 or 6 byte strings, however once editing is complete the change-buffer-charset command can be used to convert the buffer back to UTF-8 before saving.

Word characters

Word characters are those characters which are deemed to be part of a word, numbers are usually included. Many MicroEmacs commands use the 'Word' character set such as forward-word(2) and upper-case-word(2). The characters that form the word class should be determined by the display character set being used rather than by the language, this is because which character are letters within the display character set doesn't change and flagging all letters improves general usability. For example, English does not typically use accented letters but it does inherit words that do, eg. fianc'e, so treating 'e ('e' acute) as a letter always makes more sense.

The 'a' flag of command set-char-mask(2) is used to specify whether a character is part of a word, you must specify the uppercase letter and then the lowercase equivalent so the case conversion functions work correctly.

This may unfortunately be made a little more tricky by the requirement that this list must be specified in the most appropriate character set (see Internal Character Set section). When extending the word character set the characters have to be mapped to the current character set which may not support all the required characters. For example in the PC-437 DOS character set there is an e-grave (`e) but no E-grave so the E-grave is mapped to the normal E. As a result, if trying to write French text the case changing commands will behave oddly, for example:

    r`egle -> REGLE -> r`egl`e

The conversion of all 'E's to '`e' is an undesirable side effect of '`E' being mapped to E. This can be avoided by redefining the base letter again at the end of the word character list, for example:

set-char-mask "a" "`E`eEe"

Keyboard Support

The keyboard to character mapping is defined in the Start-Up page of user-setup(3), where the keyboard may be selected from a list of known keyboards. If your keyboard is not present, or is not working correctly, then this section should allow you to fix the problem (please send JASSPA the fix).

Most operating systems seem to handle keyboard mappings with the exception of MS-Windows which requires a helping hand. The root of the problems with MS-Windows is it's own locale character mappings which change the visibility status of the keyboard messages which conflict with Emacs keystroke bindings. To support key-bindings like 'C-tab' or 'S-return' a low level keyboard interface is required, but this can lead to strange problems with the more obscure keys, particularly with the 'Alt Gr' accented letter keys. For example on American keyboards pressing 'C-#' results in two 'C-#' key events being generated, this peculiarity only occurs with this one key. On a British keyboard the same key generates a 'C-#' followed by a 'C-\'.

This problem can be diagnosed using the $recent-keys(5) variable. Simply type an obvious character, e.g. 'A' then the offending key followed by another obvious key ('B'), then look for this key sequence in the $recent-keys variable (use the list-variables(2) or describe-variable(2) command). So for the above British keyboard problem the recent-keys would be:

    B C-\\ C-# A

($recent-keys lists the keys backwards). Once you have found the key sequence generated by the key, the problem may be fixed using the translate-key(2) to automatically convert the incorrect key sequence into the required key. For the problem above the following line is required:

translate-key "C-# C-\\" "C-#"

Note that once a key sequence has been translated everything, including $recent-keys, receive only the translated key. So if you a suspected a problem with the existing definition, change the keyboard type in user-setup to Default so no translations are performed, quit and restart MicroEmacs before attempting to re-diagnose the problem.

All the settings required for supporting each keyboard may be found in the keyboard.emf macro file, so if your keyboard is not supported, this is the file you need to edit.

Language

The current language can be set in the Start-up page of user-setup(3), setting the language lets to the following two variables being set:

.spell.language

Note on Names and IDs

.change-font.ln-type

24

.change-font.cs-type

charset.emf

If your language is not supported you will need to add it to the list and define these two variables found in the language.emf macro file.

Spell Support

The current language is set using the Language setting on General page of user-setup(3), if your required language is not listed you must first create the basic language support by following the guide lines in the Word Character section above. If you Language is listed, select it and enable it by either pressing Current or saving and restarting MicroEmacs. in a suitable test buffer run the spelling checker, one of three things will happen:

The Spelling Checker dialog opens and spelling is checked

The spelling checker is supported by the current language and can be used (the rules and dictionaries have been downloaded and installed).

Dialog opens with the following error message:

Rules and dictionaries for language "XXXX" 
   are not available, please download.

The spelling checker is supported by the current language but the required rules and dictionaries have not been downloaded. You should be able to download them from the JASSPA website, see Contact Information. Once downloaded they must be placed in the MicroEmacs search path, i.e. where the other macro files (like me.emf) are located.

Dialog opens with the following error message:

Language "XXXX" not supported!

The spelling checker is not supported by the current language, see the following Adding Spell Support section.

Adding Spell Support

To support a language MicroEmacs's spelling checker requires a base word dictionary and a set of rules which define what words can be derived from each base word in the dictionary. The concept and format of the word list and rules are compatible with the Free Software Foundation GNU ispell(1), myspell(1) and hunspell(1) packages. Not all features of hunspell are supported by MicroEmacs, most notably composite rules and right to left languages.

The best starting point is to obtain rules and dictionary files (<lang>.aff & <lang>.dic) for one of these packages, the web can usually yield these. Once these have been obtained the rules file (or affix file) must be converted to a MicroEmacs macro file calling the add-spell-rule(2) command to define the rules. The rule file should be named "lsr<lang-id>.emf" where "<lang-id>" is the spelling language id, determined by the .spell.language variable set in the language.emf macro file.

The spellutl.emf macro file contains the command ispell-convert which will attempt to convert the aff and dic files and macros spell-check-list, spell-check-guess and ispell-test to aid conversion accuracy verification. See existing spelling rule files (lsr*.emf) for examples and help on command add-spell-rule(2).

Note: the character set used by the rules should be the most appropriate for the language (see Internal Character Setsection), this can make the process more difficult, but adding support for the base character sets involved to charset.emf will enable the command change-buffer-charset(3) to function correctly and avoid some of the more subtle issues that can arise from incompatible character sets. If you are having difficulty with this please e-mail JASSPA Support.

Once the conversion process has completed it should have generated a "lsr<lang-id>.emf" rules file and a "lsm<lang-id>.edf" dictionary file. With the rapid increase in memory capacity there is little point these days in splitting dictionaries into two files so the use of an extended dictionary file for obscure words has been dropped.

Once the generated word and dictionary files have been place in the MicroEmacs search path, the spelling checker should find and use them. Please submit your generated support to MicroEmacs for others to benefit.