set-char-mask(2)

[Home] [Commands] [Variables] [Macro-Dev] [Glossary]

NAME

set-char-mask - Set character word mask

SYNOPSIS

n set-char-mask "flags" ["value"]

DESCRIPTION

set-char-mask returns or modifies the setting of MicroEmacs internal character tables. The argument n defines the action to be taken, as follows:-

-1

Removes characters from the given set.

$result(5)

Adds characters to the given set.

The first argument "flags" determines the required character set as follows:-

Character set Map. Internally MicroEmacs uses a standard character set, or code page, for each language so that only one set of spelling rules and dictionaries are required, however the current display character set may be different so a mapping is required to enable MicroEmacs to convert between the two. By convention MicroEmacs uses the most appropriate Windows code page for its internal character set, e.g. Windows CP-1252 is the internal character set used for American, British, French, German, Spanish, Portuguese and Italian while CP-1253 is used for Greek. Windows code pages are used in preference to ISO-8859 character sets because they tend to have more characters available so have slightly better language coverage.

The "value" for the M flag must be a string containing pairs of characters, an internal character set character followed by its display character set equivalent. For example, if the current internal character set is CP-1252 and the display character set is CP-437 (a common DOS code page), then 'e' acute must be mapped from 0xe9 in CP-1252 to 0x82 in CP-437 so "value" should contain "\xe9\x82" as well as many other mappings. All characters in the display character set should be mapped if possible rather than just the letters most commonly used by the current language as this creates the best support for any text entered.

Some display character sets may not have all the characters available in the internal character set, for instance DOS code page CP-437 does not have an upper-case 'E' grave. In this case an ordinary 'E' should be used as a sensible replacement, i.e. "`EE" (where `E is an 'E' grave) as this is the best that can be done given the limitations of the current display character set.

This flag cannot be incrementally altered, any calls to alter this set leads to the resetting of all the character tables so the character mapping must be performed first and in a single call. No other set may be altered in the same call.

0

value

The mapping is used by MicroEmacs on Windows and UNIX XTerm systems to better map the system clipboard text to the current display character set. For example XTerm typically uses an A-F code page which does not support a euro currency symbol, however CP-1252 based fonts can be installed and used correctly by MicroEmacs allowing support for the euro and many other characters. This mapping then allows MicroEmacs to correctly handle these characters when copied between different applications. The mapping table is also used by the &uni(4) functions, expand-iso-accents(3) and change-buffer-charset(3) commands.

^?

?

^A

\xhh

hh

.

$result

value

Note that the returned character list will pair all lower-case characters with their upper-case equivalent letters first.

a

z

A

Z

0

9

a

f

$buffer-mask(5)

.

-

'

-

.

1, 2, 3 & 4

forward-word(2)

_

$buffer-mask(5)

C-x k

[y/n]

As with flag M, this cannot be incrementally altered, any call to set this mapping first resets the mapping table so the mapping must be performed in a single call. No other set may be altered in the same call. When setting, the "value" must supply pairs of characters, the keyboard character followed by the character to map it to, typically an ASCII character.

Unless stated otherwise, multiple flags may be specified at the same time returning a combined character set or setting multiple properties for the given "value" characters.

EXAMPLE

For many UNIX XTerm fonts the best characters to use for $box-chars(5) (used in drawing osd(2) dialogs) lie in the range 0x0B to 0x19. For example the vertical bar is '\x19', the top left hand corner is '\x0D' etc. These characters are by default set to be not displayable or pokable which renders them useless. They can be made displayable and pokable as follows:-

set-char-mask "dp" "\x19\x0D\x0C\x0E\x0B\x18\x15\x0F\x16\x17\x12"

MicroEmacs variables have either '$', '#', '%', ':' or a '.' character prepended to their name, they may also contain a '-' character in the body of their name. It is preferable for these characters to be part of the variable 'word' so commands like forward-kill-word(2) can work correctly. This may be achieved by adding these characters to user set 2 and setting the buffer-mask variable to include set 2, as follows:

set-char-mask "2" "$#%:.-" 

define-macro fhook-emf 
    set-variable $buffer-mask "luh2" 
    . 
    . 
!emacro

For the examples below only the following subset of characters will be used:-

Character               Win CP-1252   Cmd CP-850     DOS CP-437 

Capital A (A)           A             A              A 
Capital A grave (`A)    \xC0          \xB7           No equivalent 
Capital A acute ('A)    \xC1          \x90           No equivalent 
Small a (a)             a             a              a 
Small A grave (`a)      \xE0          \x85           \x85 
Small A acute ('a)      \xE1          \xA0           \xA0

As the spell checker for French will operates in Windows CP-1252, the character font mapping (flag M) must be correctly setup for spell checking to operate correctly. When CP-1252 is also used as the display character set the mapping is the empty string as the internal and display character set are fully in-sync, but for both Windows Console CP-850 and DOS code page CP-437 the mappings should be set as follows:-

; CP-850 mapping setup 
set-char-mask "M" "\xC0\xB7\xC1\x90\xE0\x85\xE1\xA0" 
; CP-437 mapping setup 
set-char-mask "M" "\xC0A\xC1AAA\xE0\x85\xE1\xA0"

As all the characters in CP-1252 have equivalents in CP-850, the mapping for Windows console is a simple 1-to-1 lossless character list. However the missing capital A's in CP-437 causes problems, for the command change-buffer-charset(3) it is preferable for a mapping of `A to be given, otherwise the document being converted may become corrupted and unreadable. Therefore a mapping of `A to A is given to alleviate this problem, similarly 'A is also mapped to A leading to loss of information.

This leads to a further problem with the conversion of CP-437 back to CP-1252, if the mapping the 'A's was left as just "\xC0A\xC1A" the last mapping ('A to A) would also be the back conversion for A, i.e. ALL A's would be converted back to 'A's. To solve this problem, a further seemingly pointless mapping of A to A is given to correct the back conversion.

While ISO-8859-1 (Latin 1) supports a very similar set of characters to CP-1252, it lacks some accented 'S', 'Y' & 'Z' characters must be mapped to their plain letter equivalents.

For languages which use accented characters, the alphabetic character set must be extended to include these characters for letter based commands like forward-word(2) and upper-case-word(2) to operate correctly. However, the letter set should be fully extended for each code page regardless of the language being used as an 'a' acute should always be considered a letter even though it is unlikely to occur. The addition of extra letters must achieve two goals, firstly to define whether a character is a letter, enabling commands like forward-word to work correctly. The second is to provide an upper case to lower case character mapping, enabling commands like upper-case-word to work correctly. This is achieved with a single call to set-char-mask using the a flag as follows:-

set-char-mask "a" "\xC0\xE0\xC1\xE1"

Note that this flag always expects an internal character set based string, this allows the same map character list to be used regardless of the display character set being used, i.e. the above line can be used for CP-1252, CP-850, CP-437 & ISO-8859-1 code pages. But it does mean that the internal to display character set mapping (flag M) must already have been provided.

Similar mapping problems are encountered with the a flag as with flag M above. The problem is not immediately obvious because the mapping is always given in internal character set which will support the widest set of characters, but when CP-437 is used the mapping string of "A\x85A\xA0" must be used. As can be seen, A is mapped last to 'a so an upper to lower character operation will convert a A to 'a. A similar solution is used, a further mapping of A to a is given to correct the default case mapping for both A and a, i.e. the following line should always be used instead:-

set-char-mask "a" "\xC0\xE0\xC1\xE1Aa"

set-char-mask(2)

NAME

SYNOPSIS

DESCRIPTION

EXAMPLE

SEE ALSO