&uni(4)&uni, &uci, &ufi - Miscellaneous functions
&uni int
&uci char
&ufi string
These three functions provide information about Unicode charater support within the current MicroEmacs character-set.
&uni int
&uci char
&ufi string
All three functions return a MicroEmacs formatted list of information of the follow form:
"|flags|Unicode|char|utf8-len|utf8|"
flags is a bitwise flag where bit 1 indicates which the given input was a valid Unicode character, see notes below and bit 2 indicates whether the input is supported in the current MicroEmacs display character-set.
Unicode is the hexadecimal value representing the input or zero if the input cannot be mapped to a valid Unicode character (bit 1 of flags is clear). The value is always 6 digits long with a 0x hexadecimal number prefix.
char is the MicroEmacs display character-set equivalent of the input or the character 0x07 if the input is not supported (bit 2 of flags is clear).
In the case of &ufi, utf8-len is set to the length of string parsed, this may be less than the total length of string, otherwise utf8-len is the string length of utf8. Note that if the input character is 0x00 (the Nul character) the utf8 string is "" (the empty string and 0 in length) but the utf8-len is
utf8 is the UTF-8 equivalent of the input or the character string 0xEF 0xBF 0xBD (U+FFFD - replacement char) if the input to &uni or &uci cannot be mapped to a valid Unicode character.
The following macro code can be used to convert the UTF-8 encode character string at the current location to the current display character set equivalent:
set-variable #l0 &ufi &mid @wl $window-col 4 !if ¬ &band 1 &lget #l0 1 ml-write "Error: String at current location is not a valid UTF-8 char" !elif ¬ &band 2 &lget #l0 1 ml-write "Error: UTF-8 character not supported in current dsplay charset" !else &lget #l0 4 forward-delete-char insert-string &lget #l0 3 !endif
In 2003 the Unicode character range was officially restricted to 0x0000-0x10ffff (see RFC 3629) which means that a 6 digit hexadecimal number can be used to represent all possible Unicode characters and the longest UTF-8 string is 4 bytes (0xF4 0x8F 0xBF 0xBD is the highest assigned char which is for private use). Currently the highest assigned non-private character is U+E01EF so 5 digits will cover all publicly used characters.
The mapping to and from the current MicroEmacs display character-set is configured by flag c of command set-char-mask(2).
In MicroEmacs character 0x07 is used to represent an unmappable character, this is better than using a character like '?' which is in common use. If extended character rendering is being used (see bit 0x10000 of $system(5)) the character is drawn as a crossed diamond, an attempt to represent the Unicode replacement character.
(c) Copyright JASSPA 2025
Last Modified: 2024/05/09
Generated On: 2025/09/29