ctab(4) — File Formats

OSF

NAME

ctab − Locale character classification, case conversion, and collating input file

DESCRIPTION

A locale character classification, case conversion and collating input file consists of records separated by newline characters. Each record consists of one character or collation element in the locale, where a collation element is a sequence of two or more characters that collate as a single unit. These files are not directly accessed by user programs: the ctab command reads them to produce binary files loaded by the setlocale() function.

The ordering of the records determines the order of the locale’s characters. Records marked with the translate or ignore indicator (see KEYWORDS) do not reflect this ordering. The ordering of characters in a locale may also be referred to as their collation weights.

Several characters may have the same primary collation weights but different secondary weights. In French, the plain and accented versions of a’s all sort to the same primary location. If there is a tie between a plain and accented character, however, a secondary sort is applied. A group of characters with the same primary collation value are said to belong to the same equivalence class. If a character is not part of an equivalence class, it has identical primary and secondary collation weights.

This primary and secondary collation weight information is used in applications, such as grep, which use ctab information to determine string sequence.

The ctab input file describes the collating weights for an assumed code set and a particular language. If a character is encountered which does not appear in the ctab file corresponding to the current locale, the character’s collation weight will be based on its relative position in the current code set.

Records in the locale ctab input files have fields separated by a separator character (By default, this separator is a : (colon), but the user can change this; see KEYWORDS). The records have the following fields:

subject character
The subject character field is actually the collating element, which may be comprised of more than one character. If the subject character is a multicharacter collating element, the first character in the element must also be defined as a subject character elsewhere in the input file. If the character or collating element is followed by the equivalence class character, which is a ^ (circumflex) by default, it is given the same primary collating weight as the character represented by the preceding record. The secondary collation weight is unique. Characters can be specified using octal escape sequences consisting of a \\*O (backslash) followed by one or more octal digits. Any backslash not followed by an octal digit is an escape character. The subject character field must be terminated by a separator character even if there are no other fields in the record.

case conversion
The case conversion field specifies the character that is the inverse case of the character in the first field. For example, if the first field is p, the second field is P. If the third field, the character classification field (see below), contains an l or L (for lowercase), the second field specifies the uppercase equivalent of the subject character. If the character classification field contains a u or U (for uppercase), the case conversion field specifies the lowercase equivalent of the subject character. Any character with a nonempty case conversion field can specify the corresponding uppercase or lowercase letter. Characters classified as alphabetic do not require a corresponding case; that is, the second field can be empty. The second field currently is not used for SJIS characters when Japanese Language Support is installed.

character classification
The character classification field values assume the following classes and values:

u or UUppercase letter

l or LLowercase letter

a or AAlphabetic character

n or NDigits

x or XHexadecimal digits

p or PPunctuation characters

s or SWhitespace characters

c or CControl characters

g or GGraphic

-No type

Characters can belong to more than one character class, subject to certain rules. The difference between graphic and printable characters is that the set of graphic characters does not include the space character, but the set of printable characters does include the space character. The ASCII code set is predefined as follows:

A through ZUppercase letters

a through zLowercase letters

A through Z, and a through z
Alphabetic characters

0 through 9Digits

Alphabetic characters and digits
Alphanumeric characters

0 through 9, A through F, and a through f
Hexadecimal digits

Any character below the Space character and the Delete character
Control characters

Space, formfeed, newline, carriage-return, horizontal tab, and vertical tab
Whitespace characters

Any character except the above
Punctuation characters

Characters not defined as alphabetic are automatically defined as punctuation.

Keywords

A line beginning with the word "option" serves to change one or more of the default conditions or metacharacters built into the collating table. The word "option" is followed by one or more keyword/value pairs. Keywords and values are separated by tab or space characters. The following keywords are recognized:

commentUses the assigned value as the comment character. The default value is the # (number sign). Anything on a line that follows the comment character is ignored.

sepUses the assigned value as the field separator character. The default value is a : (colon). Tabs or spaces can surround fields or separators.

ignoreUses the assigned value as the ignore character indicator. The default value is the @ (at sign). A character marked with the ignore indicator is ignored for collation purposes.

repeatUses the assigned value as the equivalence class indicator. The default value is the ^ (circumflex) character. A character marked with the equivalence class indicator has the same primary collation value as the preceding character.

transUses the assigned character as the translate indicator. The default value is the | (vertical bar). A collation element marked with the translate indicator is translated to the collation element(s) following the indicator. For example, to treat the German eszet (ß) element as the two characters ss, the first field of the line would be:

\337|ss:

The unique collation weight is used in regular expressions (see grep). Characters being translated cannot be followed by an equivalence character. The subject character cannot be contained in its own substitution collation element(s) (not o|oe). The translation mechanism completes in one pass: none of the characters in the substitution collation elements can in turn be the subject of further translation, so the following example is illegal:

q|r:
x|pq:

Characters being translated have no primary collating weight of their own, but have a unique collation weight, which is based on the order of the input line of the input file.

EXAMPLES

The following line is interpreted as a field containing a backslash and a colon followed by a field separator:

\\\::

Here are the first and last three lines of a sample C.ctab file:

\000:
\001:
\002:
}:
~:
\177::c

FILES

/usr/lib/nls/loc/<locale>
Binary character classification, case conversion and collating output file for locale <locale>.

/etc/nls/loc/<locale>
Binary locale classification, case conversion and collating output file. This is only used as a default during single-user mode operation.

RELATED INFORMATION

Commands: ctab(1)

Functions: setlocale(3)

OSF/1 User’s Guide

Museum