&uni(4)

[Home] [Commands] [Variables] [Macro-Dev] [Glossary]

NAME

&uni, &uci, &ufi - Miscellaneous functions

SYNOPSIS

&uni int
&uci char

&ufi string

DESCRIPTION

These three functions provide information about Unicode charater support within the current MicroEmacs character-set.

&uni int

int

&uci char

char

&ufi string

string

All three functions return a MicroEmacs formatted list of information of the follow form:

"|flags|Unicode|char|utf8-len|utf8|"

flags is a bitwise flag where bit 1 indicates which the given input was a valid Unicode character, see notes below and bit 2 indicates whether the input is supported in the current MicroEmacs display character-set.

Unicode is the hexadecimal value representing the input or zero if the input cannot be mapped to a valid Unicode character (bit 1 of flags is clear). The value is always 6 digits long with a 0x hexadecimal number prefix.

char is the MicroEmacs display character-set equivalent of the input or the character 0x07 if the input is not supported (bit 2 of flags is clear).

In the case of &ufi, utf8-len is set to the length of string parsed, this may be less than the total length of string, otherwise utf8-len is the string length of utf8. Note that if the input character is 0x00 (the Nul character) the utf8 string is "" (the empty string and 0 in length) but the utf8-len is

utf8 is the UTF-8 equivalent of the input or the character string 0xEF 0xBF 0xBD (U+FFFD - replacement char) if the input to &uni or &uci cannot be mapped to a valid Unicode character.

EXAMPLE

The following macro code can be used to convert the UTF-8 encode character string at the current location to the current display character set equivalent:

set-variable #l0 &ufi &mid @wl $window-col 4 
!if &not &band 1 &lget #l0 1 
  ml-write "Error: String at current location is not a valid UTF-8 char" 
!elif &not &band 2 &lget #l0 1 
  ml-write "Error: UTF-8 character not supported in current dsplay charset" 
!else 
  &lget #l0 4 forward-delete-char 
  insert-string &lget #l0 3 
!endif

NOTES

In 2003 the Unicode character range was officially restricted to 0x0000-0x10ffff (see RFC 3629) which means that a 6 digit hexadecimal number can be used to represent all possible Unicode characters and the longest UTF-8 string is 4 bytes (0xF4 0x8F 0xBF 0xBD is the highest assigned char which is for private use). Currently the highest assigned non-private character is U+E01EF so 5 digits will cover all publicly used characters.

The mapping to and from the current MicroEmacs display character-set is configured by flag c of command set-char-mask(2).