ICONV(1V) — UNIX Programmer’s Manual

NAME

iconv − codeset convertion

SYNOPSIS

iconv −f fromcode −t tocode [file]

iconv −l

DESCRIPTION

The iconv utility converts the encoding of the characters in file from one codeset to another and writes the results to standard output. The input and output codesets are identified by fromcode and tocode respectively. If no file argument is specified on the command line, iconv reads the standard input.

Character encodings in either codeset may include single-byte values (e.g., for ISO standard ISO 8859−1:1987 characters) or multi-byte values (e.g., for certain characters in ISO standard ISO 6937/2:1983). If invalid characters are found in the input stream then the invalid code is reported and the default character (question mark) is output.

OPTIONS

−l lists the character encodings which are supported and which can be used as the parameters fromcode and tocode.

EXAMPLE

The following example converts the contents of file mail.x400 from codeset ISO 6937/2:1983 to ISO 8859-1:1987, and stores the results in file mail.local.

iconv -f 6937_2 -t 8859-1 mail.x400 > mail.local

DATA STRUCTURE FORMAT

For each codeset supported by iconv, there is a structure which provides mappings from the codeset to an internal meta-character set. This meta-character set is based on the ISO-LATIN series of codesets ISO LATIN/1 to ISO LATIN/7. Any character not in any of these codesets is assigned an arbitrary meta-character.

The structure defining a given codeset contains three fields:

cs_name codeset name, as output by using the −l option.

cs_primary list of primary mappings for a character. This is indexed by the character (therefore is always 256 bytes long) and can take three forms:

a) value MS_UD − this character does not exist in the codeset.

b) value MS_ESC − this character is possibly the first of a two-character sequence, cs_secondary is used to determine the meta-character.

c) anything else − the value given is the meta-character associated with the input character.

cs_secondary an array of arrays, one for each of the possible two-character sequences identified in cs_primary. The rules governing the format of these arrays are as follows:

n = number of array members; members are indexed from 0 to m (m = n - 1); i is an arbitrary index.

a) the length of the arrays is always 2 ∗ n; array[(i ∗ 2)] contains the secondary character of the two-character sequence and array[(i ∗ 2) + 1] contains the meta-character associated with the sequence.

b) array[0] contains the primary character for the two-character sequence, array[1] contains the meta-character for the primary character if it also occurs as a single character (if the character following the primary in the input stream is not found in the list of known secondary characters then it is assumed that the primary is a single character); if the primary never exists as a single character then array[1] should be MS_UD.

c) the last element in the array must mark the end of the list, i.e. array[(m ∗ 2)] can be anything but array[(m ∗ 2) + 1] must contain MS_EOL.

If no two-character mappings exist then cs_secondary may take a value of NULL.

System V