1710 lines
		
	
	
		
			44 KiB
		
	
	
	
		
			TeX
		
	
	
	
			
		
		
	
	
			1710 lines
		
	
	
		
			44 KiB
		
	
	
	
		
			TeX
		
	
	
	
@node Iconv
 | 
						|
@chapter Encoding conversions (@file{iconv.h})
 | 
						|
 | 
						|
This chapter describes the Newlib iconv library.
 | 
						|
The iconv functions declarations are in
 | 
						|
@file{iconv.h}.
 | 
						|
 | 
						|
@menu
 | 
						|
* iconv::                           Encoding conversion routines
 | 
						|
* Introduction to iconv::           Introduction to iconv and encodings
 | 
						|
* Supported encodings::             The list of currently supported encodings
 | 
						|
* iconv design decisions::          General iconv library design issues
 | 
						|
* iconv configuration::             iconv-related configure script options
 | 
						|
* Encoding names::                  How encodings are named.
 | 
						|
* CCS tables::                      CCS tables format and 'mktbl.pl' Perl script
 | 
						|
* CES converters::                  CES converters description
 | 
						|
* The encodings description file::  The 'encoding.deps' file and 'mkdeps.pl'
 | 
						|
* How to add new encoding::         The steps to add new encoding support
 | 
						|
* The locale support interfaces::   Locale-related iconv interfaces
 | 
						|
* Contact::                         The author contact
 | 
						|
@end menu
 | 
						|
 | 
						|
@page
 | 
						|
@include iconv/iconv.def
 | 
						|
 | 
						|
@page
 | 
						|
@node Introduction to iconv
 | 
						|
@section Introduction to iconv
 | 
						|
@findex encoding
 | 
						|
@findex character set
 | 
						|
@findex charset
 | 
						|
@findex CES
 | 
						|
@findex CCS
 | 
						|
@*
 | 
						|
The iconv library is intended to convert characters from one encoding to
 | 
						|
another. It implements iconv(), iconv_open() and iconv_close()
 | 
						|
calls, which are defined by the Single Unix Specification.
 | 
						|
 | 
						|
@*
 | 
						|
In addition to these user-level interfaces, the iconv library also has
 | 
						|
several useful interfaces which are needed to support coding
 | 
						|
capabilities of the Newlib Locale infrastructure.  Since Locale 
 | 
						|
support also needs to
 | 
						|
convert various character sets to and from the @emph{wide characters
 | 
						|
set}, the iconv library shares it's capabilities with the Newlib Locale
 | 
						|
subsystem. Moreover, the iconv library supports several features which are
 | 
						|
only needed for the Locale infrastructure (for example, the MB_CUR_MAX value).
 | 
						|
 | 
						|
@*
 | 
						|
The Newlib iconv library was created using concepts from another iconv
 | 
						|
library implemented by Konstantin Chuguev (ver 2.0). The Newlib iconv library
 | 
						|
was rewritten from scratch and contains a lot of improvements with respect to
 | 
						|
the original iconv library. 
 | 
						|
 | 
						|
@*
 | 
						|
Terms like @dfn{encoding} or @dfn{character set} aren't well defined and
 | 
						|
are often used with various meanings. The following are the definitions of terms
 | 
						|
which are used in this documentation as well as in the iconv library
 | 
						|
implementation:
 | 
						|
 | 
						|
@itemize @bullet
 | 
						|
@item
 | 
						|
@dfn{encoding} - a machine representation of characters by means of bits;
 | 
						|
 | 
						|
@item
 | 
						|
@dfn{Character Set} or @dfn{Charset} - just a collection of
 | 
						|
characters, i.e. the encoding is the machine representation of the character set; 
 | 
						|
 | 
						|
@item
 | 
						|
@dfn{CCS} (@dfn{Coded Character Set}) - a mapping from an character set to a
 | 
						|
set of integers @dfn{character codes};
 | 
						|
 | 
						|
@item
 | 
						|
@dfn{CES} (@dfn{Character Encoding Scheme}) - a mapping from a set of character
 | 
						|
codes to a sequence of bytes;
 | 
						|
@end itemize
 | 
						|
 | 
						|
@*
 | 
						|
Users usually deal with encodings, for example, KOI8-R, Unicode, UTF-8,
 | 
						|
ASCII, etc. Encodings are formed by the following chain of steps:
 | 
						|
 | 
						|
@enumerate
 | 
						|
@item
 | 
						|
User has a set of characters which are specific to his or her language (character set).
 | 
						|
 | 
						|
@item
 | 
						|
Each character from this set is uniquely numbered, resulting in an CCS.
 | 
						|
 | 
						|
@item
 | 
						|
Each number from the CCS is converted to a sequence of bits or bytes by means
 | 
						|
of a CES and form some encoding. Thus, CES may be considered as a
 | 
						|
function of CCS which produces some encoding. Note, that CES may be
 | 
						|
applied to more than one CCS.
 | 
						|
@end enumerate
 | 
						|
 | 
						|
@*
 | 
						|
Thus, an encoding may be considered as one or more CCS + CES.
 | 
						|
 | 
						|
@*
 | 
						|
Sometimes, there is no CES and in such cases encoding is equivalent
 | 
						|
to CCS, e.g. KOI8-R or ASCII.
 | 
						|
 | 
						|
@*
 | 
						|
An example of a more complicated encoding is UTF-8 which is the UCS
 | 
						|
(or Unicode) CCS plus the UTF-8 CES.
 | 
						|
 | 
						|
@*
 | 
						|
The following is a brief list of iconv library features:
 | 
						|
@itemize
 | 
						|
@item
 | 
						|
Generic architecture;
 | 
						|
@item
 | 
						|
Locale infrastructure support;
 | 
						|
@item
 | 
						|
Automatic generation of the program code which handles
 | 
						|
CES/CCS/Encoding/Names/Aliases dependencies;
 | 
						|
@item
 | 
						|
The ability to choose size- or speed-optimazed
 | 
						|
configuration;
 | 
						|
@item
 | 
						|
The ability to exclude a lot of unneeded code and data from the linking step.
 | 
						|
@end itemize
 | 
						|
 | 
						|
 | 
						|
 | 
						|
 | 
						|
@page
 | 
						|
@node Supported encodings
 | 
						|
@section Supported encodings
 | 
						|
@findex big5
 | 
						|
@findex cp775
 | 
						|
@findex cp850
 | 
						|
@findex cp852
 | 
						|
@findex cp855
 | 
						|
@findex cp866
 | 
						|
@findex euc_jp
 | 
						|
@findex euc_kr
 | 
						|
@findex euc_tw
 | 
						|
@findex iso_8859_1
 | 
						|
@findex iso_8859_10
 | 
						|
@findex iso_8859_11
 | 
						|
@findex iso_8859_13
 | 
						|
@findex iso_8859_14
 | 
						|
@findex iso_8859_15
 | 
						|
@findex iso_8859_2
 | 
						|
@findex iso_8859_3
 | 
						|
@findex iso_8859_4
 | 
						|
@findex iso_8859_5
 | 
						|
@findex iso_8859_6
 | 
						|
@findex iso_8859_7
 | 
						|
@findex iso_8859_8
 | 
						|
@findex iso_8859_9
 | 
						|
@findex iso_ir_111
 | 
						|
@findex koi8_r
 | 
						|
@findex koi8_ru
 | 
						|
@findex koi8_u
 | 
						|
@findex koi8_uni
 | 
						|
@findex ucs_2
 | 
						|
@findex ucs_2_internal
 | 
						|
@findex ucs_2be
 | 
						|
@findex ucs_2le
 | 
						|
@findex ucs_4
 | 
						|
@findex ucs_4_internal
 | 
						|
@findex ucs_4be
 | 
						|
@findex ucs_4le
 | 
						|
@findex us_ascii
 | 
						|
@findex utf_16
 | 
						|
@findex utf_16be
 | 
						|
@findex utf_16le
 | 
						|
@findex utf_8
 | 
						|
@findex win_1250
 | 
						|
@findex win_1251
 | 
						|
@findex win_1252
 | 
						|
@findex win_1253
 | 
						|
@findex win_1254
 | 
						|
@findex win_1255
 | 
						|
@findex win_1256
 | 
						|
@findex win_1257
 | 
						|
@findex win_1258
 | 
						|
@*
 | 
						|
The following is the list of currently supported encodings. The first column
 | 
						|
corresponds to the encoding name, the second column is the list of aliases,
 | 
						|
the third column is its CES and CCS components names, and the fourth column
 | 
						|
is a short description.
 | 
						|
 | 
						|
@multitable @columnfractions .20 .26 .24 .30
 | 
						|
@item
 | 
						|
Name
 | 
						|
@tab
 | 
						|
Aliases
 | 
						|
@tab
 | 
						|
CES/CCS
 | 
						|
@tab
 | 
						|
Short description
 | 
						|
@item
 | 
						|
@tab
 | 
						|
@tab
 | 
						|
@tab
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
big5
 | 
						|
@tab
 | 
						|
csbig5, big_five, bigfive, cn_big5, cp950
 | 
						|
@tab
 | 
						|
table_pcs / big5, us_ascii 
 | 
						|
@tab
 | 
						|
The encoding for the Traditional Chinese.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
cp775
 | 
						|
@tab
 | 
						|
ibm775, cspc775baltic
 | 
						|
@tab
 | 
						|
table / cp775
 | 
						|
@tab
 | 
						|
The updated version of CP 437 that supports the balitic languages.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
cp850
 | 
						|
@tab
 | 
						|
ibm850, 850, cspc850multilingual
 | 
						|
@tab
 | 
						|
table / cp850
 | 
						|
@tab
 | 
						|
IBM 850 - the updated version of CP 437 where several Latin 1 characters have been
 | 
						|
added instead of some less-often used characters like the line-drawing
 | 
						|
and the greek ones.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
cp852
 | 
						|
@tab
 | 
						|
ibm852, 852, cspcp852
 | 
						|
@tab
 | 
						|
@tab
 | 
						|
IBM 852 - the updated version of CP 437 where several Latin 2 characters have been added
 | 
						|
instead of some less-often used characters like the line-drawing and the greek ones.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
cp855
 | 
						|
@tab
 | 
						|
ibm855, 855, csibm855
 | 
						|
@tab
 | 
						|
table / cp855
 | 
						|
@tab
 | 
						|
IBM 855 - the updated version of CP 437 that supports Cyrillic.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
cp866
 | 
						|
@tab
 | 
						|
866, IBM866, CSIBM866
 | 
						|
@tab
 | 
						|
table / cp866
 | 
						|
@tab
 | 
						|
IBM 866 - the updated version of CP 855 which follows more the logical Russian alphabet 
 | 
						|
ordering of the alternative variant that is preferred by many Russian users.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
euc_jp
 | 
						|
@tab
 | 
						|
eucjp
 | 
						|
@tab
 | 
						|
euc / jis_x0208_1990, jis_x0201_1976, jis_x0212_1990
 | 
						|
@tab
 | 
						|
EUC-JP - The EUC for Japanese.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
euc_kr
 | 
						|
@tab
 | 
						|
euckr
 | 
						|
@tab
 | 
						|
euc / ksx1001
 | 
						|
@tab
 | 
						|
EUC-KR - The EUC for Korean.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
euc_tw
 | 
						|
@tab
 | 
						|
euctw
 | 
						|
@tab
 | 
						|
euc / cns11643_plane1, cns11643_plane2, cns11643_plane14
 | 
						|
@tab
 | 
						|
EUC-TW - The EUC for Traditional Chinese.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
iso_8859_1
 | 
						|
@tab
 | 
						|
iso8859_1, iso88591, iso_8859_1:1987, iso_ir_100, latin1, l1, ibm819, cp819, csisolatin1
 | 
						|
@tab
 | 
						|
table / iso_8859_1
 | 
						|
@tab
 | 
						|
ISO 8859-1:1987 - Latin 1, West European.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
iso_8859_10
 | 
						|
@tab
 | 
						|
iso_8859_10:1992, iso_ir_157, iso885910, latin6, l6, csisolatin6, iso8859_10
 | 
						|
@tab
 | 
						|
table / iso_8859_10
 | 
						|
@tab
 | 
						|
ISO 8859-10:1992 - Latin 6, Nordic.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
iso_8859_11
 | 
						|
@tab
 | 
						|
iso8859_11, iso885911
 | 
						|
@tab
 | 
						|
table / iso_8859_11
 | 
						|
@tab
 | 
						|
ISO 8859-11 - Thai.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
iso_8859_13
 | 
						|
@tab
 | 
						|
iso_8859_13:1998, iso8859_13, iso885913
 | 
						|
@tab
 | 
						|
table / iso_8859_13
 | 
						|
@tab
 | 
						|
ISO 8859-13:1998 - Latin 7, Baltic Rim.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
iso_8859_14
 | 
						|
@tab
 | 
						|
iso_8859_14:1998, iso885914, iso8859_14
 | 
						|
@tab
 | 
						|
table / iso_8859_14
 | 
						|
@tab
 | 
						|
ISO 8859-14:1998 - Latin 8, Celtic.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
iso_8859_15
 | 
						|
@tab
 | 
						|
iso885915, iso_8859_15:1998, iso8859_15, 
 | 
						|
@tab
 | 
						|
table / iso_8859_15
 | 
						|
@tab
 | 
						|
ISO 8859-15:1998 - Latin 9, West Europe, successor of Latin 1.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
iso_8859_2
 | 
						|
@tab
 | 
						|
iso8859_2, iso88592, iso_8859_2:1987, iso_ir_101, latin2, l2, csisolatin2
 | 
						|
@tab
 | 
						|
table / iso_8859_2
 | 
						|
@tab
 | 
						|
ISO 8859-2:1987 - Latin 2, East European.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
iso_8859_3
 | 
						|
@tab
 | 
						|
iso_8859_3:1988, iso_ir_109, iso8859_3, latin3, l3, csisolatin3, iso88593
 | 
						|
@tab
 | 
						|
table / iso_8859_3
 | 
						|
@tab
 | 
						|
ISO 8859-3:1988 - Latin 3, South European.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
iso_8859_4
 | 
						|
@tab
 | 
						|
iso8859_4, iso88594, iso_8859_4:1988, iso_ir_110, latin4, l4, csisolatin4
 | 
						|
@tab
 | 
						|
table / iso_8859_4
 | 
						|
@tab
 | 
						|
ISO 8859-4:1988 - Latin 4, North European.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
iso_8859_5
 | 
						|
@tab
 | 
						|
iso8859_5, iso88595, iso_8859_5:1988, iso_ir_144, cyrillic, csisolatincyrillic
 | 
						|
@tab
 | 
						|
table / iso_8859_5
 | 
						|
@tab
 | 
						|
ISO 8859-5:1988 - Cyrillic.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
iso_8859_6
 | 
						|
@tab
 | 
						|
iso_8859_6:1987, iso_ir_127, iso8859_6, ecma_114, asmo_708, arabic, csisolatinarabic, iso88596
 | 
						|
@tab
 | 
						|
table / iso_8859_6
 | 
						|
@tab
 | 
						|
ISO i8859-6:1987 - Arabic.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
iso_8859_7
 | 
						|
@tab
 | 
						|
iso_8859_7:1987, iso_ir_126, iso8859_7, elot_928, ecma_118, greek, greek8, csisolatingreek, iso88597
 | 
						|
@tab
 | 
						|
table / iso_8859_7
 | 
						|
@tab
 | 
						|
ISO 8859-7:1987 - Greek.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
iso_8859_8
 | 
						|
@tab
 | 
						|
iso_8859_8:1988, iso_ir_138, iso8859_8, hebrew, csisolatinhebrew, iso88598
 | 
						|
@tab
 | 
						|
table / iso_8859_8
 | 
						|
@tab
 | 
						|
ISO 8859-8:1988 - Hebrew.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
iso_8859_9
 | 
						|
@tab
 | 
						|
iso_8859_9:1989, iso_ir_148, iso8859_9, latin5, l5, csisolatin5, iso88599
 | 
						|
@tab
 | 
						|
table / iso_8859_9
 | 
						|
@tab
 | 
						|
ISO 8859-9:1989 - Latin 5, Turkish.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
iso_ir_111
 | 
						|
@tab
 | 
						|
ecma_cyrillic, koi8_e, koi8e, csiso111ecmacyrillic
 | 
						|
@tab
 | 
						|
table / iso_ir_111
 | 
						|
@tab
 | 
						|
ISO IR 111/ECMA Cyrillic.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
koi8_r
 | 
						|
@tab
 | 
						|
cskoi8r, koi8r, koi8
 | 
						|
@tab
 | 
						|
table / koi8_r
 | 
						|
@tab
 | 
						|
RFC 1489 Cyrillic.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
koi8_ru
 | 
						|
@tab
 | 
						|
koi8ru
 | 
						|
@tab
 | 
						|
table / koi8_ru
 | 
						|
@tab
 | 
						|
The obsolete Ukrainian.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
koi8_u
 | 
						|
@tab
 | 
						|
koi8u
 | 
						|
@tab
 | 
						|
table / koi8_u
 | 
						|
@tab
 | 
						|
RFC 2319 Ukrainian.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
koi8_uni
 | 
						|
@tab
 | 
						|
koi8uni
 | 
						|
@tab
 | 
						|
table / koi8_uni
 | 
						|
@tab
 | 
						|
KOI8 Unified.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
ucs_2
 | 
						|
@tab
 | 
						|
ucs2, iso_10646_ucs_2, iso10646_ucs_2, iso_10646_ucs2, iso10646_ucs2, iso10646ucs2, csUnicode
 | 
						|
@tab
 | 
						|
ucs_2 / (UCS)
 | 
						|
@tab
 | 
						|
ISO-10646-UCS-2. Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
ucs_2_internal
 | 
						|
@tab
 | 
						|
ucs2_internal, ucs_2internal, ucs2internal
 | 
						|
@tab
 | 
						|
ucs_2_internal / (UCS)
 | 
						|
@tab
 | 
						|
ISO-10646-UCS-2 in system byte order.
 | 
						|
NBSP is always interpreted as NBSP (BOM isn't supported).
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
ucs_2be
 | 
						|
@tab
 | 
						|
ucs2be
 | 
						|
@tab
 | 
						|
ucs_2 / (UCS)
 | 
						|
@tab
 | 
						|
Big Endian version of ISO-10646-UCS-2 (in fact, equivalent to ucs_2).
 | 
						|
Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
ucs_2le
 | 
						|
@tab
 | 
						|
ucs2le
 | 
						|
@tab
 | 
						|
ucs_2 / (UCS)
 | 
						|
@tab
 | 
						|
Little Endian version of ISO-10646-UCS-2.
 | 
						|
Little Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
ucs_4
 | 
						|
@tab
 | 
						|
ucs4, iso_10646_ucs_4, iso10646_ucs_4, iso_10646_ucs4, iso10646_ucs4, iso10646ucs4
 | 
						|
@tab
 | 
						|
ucs_4 / (UCS)
 | 
						|
@tab
 | 
						|
ISO-10646-UCS-4. Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
ucs_4_internal
 | 
						|
@tab
 | 
						|
ucs4_internal, ucs_4internal, ucs4internal
 | 
						|
@tab
 | 
						|
ucs_4_internal / (UCS)
 | 
						|
@tab
 | 
						|
ISO-10646-UCS-4 in system byte order.
 | 
						|
NBSP is always interpreted as NBSP (BOM isn't supported).
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
ucs_4be
 | 
						|
@tab
 | 
						|
ucs4be
 | 
						|
@tab
 | 
						|
ucs_4 / (UCS)
 | 
						|
@tab
 | 
						|
Big Endian version of ISO-10646-UCS-4 (in fact, equivalent to ucs_4).
 | 
						|
Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
ucs_4le
 | 
						|
@tab
 | 
						|
ucs4le
 | 
						|
@tab
 | 
						|
ucs_4 / (UCS)
 | 
						|
@tab
 | 
						|
Little Endian version of ISO-10646-UCS-4.
 | 
						|
Little Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
us_ascii
 | 
						|
@tab
 | 
						|
ansi_x3.4_1968, ansi_x3.4_1986, iso_646.irv:1991, ascii, iso646_us, us, ibm367, cp367, csascii
 | 
						|
@tab
 | 
						|
us_ascii / (ASCII)
 | 
						|
@tab
 | 
						|
7-bit ASCII.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
utf_16
 | 
						|
@tab
 | 
						|
utf16
 | 
						|
@tab
 | 
						|
utf_16 / (UCS)
 | 
						|
@tab
 | 
						|
RFC 2781 UTF-16. The very first NBSP code in stream is interpreted as BOM.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
utf_16be
 | 
						|
@tab
 | 
						|
utf16be
 | 
						|
@tab
 | 
						|
utf_16 / (UCS)
 | 
						|
@tab
 | 
						|
Big Endian version of RFC 2781 UTF-16.
 | 
						|
NBSP is always interpreted as NBSP (BOM isn't supported).
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
utf_16le
 | 
						|
@tab
 | 
						|
utf16le
 | 
						|
@tab
 | 
						|
utf_16 / (UCS)
 | 
						|
@tab
 | 
						|
Little Endian version of RFC 2781 UTF-16.
 | 
						|
NBSP is always interpreted as NBSP (BOM isn't supported).
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
utf_8
 | 
						|
@tab
 | 
						|
utf8
 | 
						|
@tab
 | 
						|
utf_8 / (UCS)
 | 
						|
@tab
 | 
						|
RFC 3629 UTF-8.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
win_1250
 | 
						|
@tab
 | 
						|
cp1250
 | 
						|
@tab
 | 
						|
@tab
 | 
						|
Win-1250 Croatian.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
win_1251
 | 
						|
@tab
 | 
						|
cp1251
 | 
						|
@tab
 | 
						|
table / win_1251
 | 
						|
@tab
 | 
						|
Win-1251 - Cyrillic.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
win_1252
 | 
						|
@tab
 | 
						|
cp1252
 | 
						|
@tab
 | 
						|
table / win_1252
 | 
						|
@tab
 | 
						|
Win-1252 - Latin 1.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
win_1253
 | 
						|
@tab
 | 
						|
cp1253
 | 
						|
@tab
 | 
						|
table / win_1253
 | 
						|
@tab
 | 
						|
Win-1253 - Greek.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
win_1254
 | 
						|
@tab
 | 
						|
cp1254
 | 
						|
@tab
 | 
						|
table / win_1254
 | 
						|
@tab
 | 
						|
Win-1254 - Turkish.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
win_1255
 | 
						|
@tab
 | 
						|
cp1255
 | 
						|
@tab
 | 
						|
table / win_1255
 | 
						|
@tab
 | 
						|
Win-1255 - Hebrew.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
win_1256
 | 
						|
@tab
 | 
						|
cp1256
 | 
						|
@tab
 | 
						|
table / win_1256
 | 
						|
@tab
 | 
						|
Win-1256 - Arabic.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
win_1257
 | 
						|
@tab
 | 
						|
cp1257
 | 
						|
@tab
 | 
						|
table / win_1257
 | 
						|
@tab
 | 
						|
Win-1257 - Baltic.
 | 
						|
 | 
						|
 | 
						|
@item
 | 
						|
win_1258
 | 
						|
@tab
 | 
						|
cp1258
 | 
						|
@tab
 | 
						|
table / win_1258
 | 
						|
@tab
 | 
						|
Win-1258 - Vietnamese7 that supports Cyrillic.
 | 
						|
@end multitable
 | 
						|
 | 
						|
 | 
						|
 | 
						|
 | 
						|
 | 
						|
@page
 | 
						|
@node iconv design decisions
 | 
						|
@section iconv design decisions
 | 
						|
@findex CCS table
 | 
						|
@findex CES converter
 | 
						|
@findex Speed-optimized tables
 | 
						|
@findex Size-optimized tables
 | 
						|
@*
 | 
						|
The first iconv library design issue arises when considering the
 | 
						|
following two design approaches:
 | 
						|
 | 
						|
@enumerate
 | 
						|
@item
 | 
						|
Have modules which implement conversion from the encoding A to the encoding B
 | 
						|
and vice versa i.e., one conversion module relates to any two encodings.
 | 
						|
@item
 | 
						|
Have modules which implement conversion from the encoding A to the fixed
 | 
						|
encoding C and vice versa i.e., one conversion module relates to any
 | 
						|
one encoding A and one fixed encoding C. In this case, to convert from
 | 
						|
the encoding A to the encoding B, two modules are needed (in order to convert
 | 
						|
from A to C and then from C to B).
 | 
						|
@end enumerate
 | 
						|
 | 
						|
@*
 | 
						|
It's obvious, that we have tradeoff between commonality/flexibility and
 | 
						|
efficiency: the first method is more efficient since it converts
 | 
						|
directly; however, it isn't so flexible since for each
 | 
						|
encoding pair a distinct module is needed.
 | 
						|
 | 
						|
@*
 | 
						|
The Newlib iconv model uses the second method and always converts through the 32-bit
 | 
						|
UCS but its design also allows one to write specialized conversion
 | 
						|
modules if the conversion speed is critical.
 | 
						|
 | 
						|
@*
 | 
						|
The second design issue is how to break down (decompose) encodings.
 | 
						|
The Newlib iconv library uses the fact that any encoding may be
 | 
						|
considered as one or more CCS plus a CES. It also decomposes its
 | 
						|
conversion modules on @dfn{CES converter} plus one or more @dfn{CCS
 | 
						|
tables}. CCS tables map CCS to UCS and vice versa; the CES converters
 | 
						|
map CCS to the encoding and vice versa.
 | 
						|
 | 
						|
@*
 | 
						|
As the example, let's consider the conversion from the big5 encoding to
 | 
						|
the EUC-TW encoding. The big5 encoding may be decomposed to the ASCII and BIG5
 | 
						|
CCS-es plus the BIG5 CES. EUC-TW may be decomposed on the CNS11643_PLANE1, CNS11643_PLANE2,
 | 
						|
and CNS11643_PLANE14 CCS-es plus the EUC CES.
 | 
						|
 | 
						|
@*
 | 
						|
The euc_jp -> big5 conversion is performed as follows:
 | 
						|
 | 
						|
@enumerate
 | 
						|
@item
 | 
						|
The EUC converter performs the EUC-TW encoding to the corresponding CCS-es
 | 
						|
transformation (CNS11643_PLANE1, CNS11643_PLANE2 and CNS11643_PLANE14
 | 
						|
CCS-es);
 | 
						|
@item
 | 
						|
The obtained CCS codes are transformed to the UCS codes using the CNS11643_PLANE1,
 | 
						|
CNS11643_PLANE2 and CNS11643_PLANE14 CCS tables;
 | 
						|
@item
 | 
						|
The resulting UCS codes are transformed to the ASCII and BIG5 codes using
 | 
						|
the corresponding CCS tables;
 | 
						|
@item
 | 
						|
The obtained CCS codes are transformed to the big5 encoding using the corresponding
 | 
						|
CES converter.
 | 
						|
@end enumerate
 | 
						|
 | 
						|
@*
 | 
						|
Analogously, the backward conversion is performed as follows:
 | 
						|
 | 
						|
@enumerate
 | 
						|
@item
 | 
						|
The BIG5 converter performs the big5 encoding to the corresponding CCS-es transformation
 | 
						|
(the ASCII and BIG5 CCS-es);
 | 
						|
@item
 | 
						|
The obtained CCS codes are transformed to the UCS codes using the ASCII and BIG5 CCS tables;
 | 
						|
@item
 | 
						|
The resulting UCS codes are transformed to the ASCII and BIG5 codes using
 | 
						|
the corresponding CCS tables;
 | 
						|
@item
 | 
						|
The obtained CCS codes are transformed to the EUC-TW encoding using the corresponding
 | 
						|
CES converter.
 | 
						|
@end enumerate
 | 
						|
 | 
						|
@*
 | 
						|
Note, the above is just an example and real names (which are implemented
 | 
						|
in the Newlib iconv) of the CES converters and the CCS tables are slightly different.
 | 
						|
 | 
						|
@*
 | 
						|
The third design issue also relates to flexibility. Obviously, it isn't
 | 
						|
desirable to always link all the CES converters and the CCS tables to the library
 | 
						|
but instead, we want to be able to load the needed converters and tables
 | 
						|
dynamically on demand. This isn't a problem on "big" machines such as
 | 
						|
a PC, but it may be very problematical within "small" embedded systems.
 | 
						|
 | 
						|
@*
 | 
						|
Since the CCS tables are just data, it is possible to load them
 | 
						|
dynamically from external files.  The CES converters, on the other hand
 | 
						|
are algorithms with some code so a dynamic library loading 
 | 
						|
capability is required.
 | 
						|
 | 
						|
@*
 | 
						|
Apart from possible restrictions applied by embedded systems (small
 | 
						|
RAM for example), Newlib itself has no dynamic library support and
 | 
						|
therefore, all the CES converters which will ever be used must be linked into
 | 
						|
the library.   However, loading of the dynamic CCS tables is possible and is
 | 
						|
implemented in the Newlib iconv library.  It may be enabled via the Newlib
 | 
						|
configure script options.
 | 
						|
 | 
						|
@*
 | 
						|
The next design issue is fine-tuning the iconv library
 | 
						|
configuration.  One important ability is for iconv to not link all it's
 | 
						|
converters and tables (if dynamic loading is not enabled) but instead,
 | 
						|
enable only those encodings which are specified at configuration
 | 
						|
time (see the section about the configure script options).
 | 
						|
 | 
						|
@*
 | 
						|
In addition, the Newlib iconv library configure options distinguish between
 | 
						|
conversion directions. This means that not only are supported encodings
 | 
						|
selectable, the conversion direction is as well. For example, if user wants
 | 
						|
the configuration which allows conversions from UTF-8 to UTF-16 and
 | 
						|
doesn't plan using the "UTF-16 to UTF-8" conversions, he or she can 
 | 
						|
enable only
 | 
						|
this conversion direction (i.e., no "UTF-16 -> UTF-8"-related code will
 | 
						|
be included) thus, saving some memory (note, that such technique allows to
 | 
						|
exclude one half of a CCS table from linking which may be big enough).
 | 
						|
 | 
						|
@*
 | 
						|
One more design aspect are the speed- and size- optimized tables. Users can
 | 
						|
select between them using configure script options. The
 | 
						|
speed-optimized CCS tables are the same as the size-optimized ones in
 | 
						|
case of 8-bit CCS (e.g.m KOI8-R), but for 16-bit CCS-es the size-optimized
 | 
						|
CCS tables may be 1.5 to 2 times less then the speed-optimized ones. On the
 | 
						|
other hand, conversion with speed tables is several times faster.
 | 
						|
 | 
						|
@*
 | 
						|
Its worth to stress that the new encoding support can't be
 | 
						|
dynamically added into an already compiled Newlib library, even if it
 | 
						|
needs only an additional CCS table and iconv is configured to use
 | 
						|
the external files with CCS tables (this isn't the fundamental restriction
 | 
						|
and the possibility to add new Table-based encoding support dynamically, by
 | 
						|
means of just adding new .cct file, may be easily added).
 | 
						|
 | 
						|
@*
 | 
						|
Theoretically, the compiled-in CCS tables should be more appropriate for
 | 
						|
embedded systems than dynamically loaded CCS tables.  This is because the compiled-in tables are read-only and can be placed in ROM
 | 
						|
whereas dynamic loading requires RAM.  Moreover, in the current iconv
 | 
						|
implementation, a distinct copy of the dynamic CCS file is loaded for each opened iconv descriptor even in case of the same encoding.
 | 
						|
This means, for example, that if two iconv descriptors for
 | 
						|
"KOI8-R -> UCS-4BE" and "KOI8-R -> UTF-16BE" are opened, two copies of
 | 
						|
koi8-r .cct file will be loaded (actually, iconv loads only the needed part
 | 
						|
of these files).  On the other hand, in the case of compiled-in CCS tables, there will always be only one copy.
 | 
						|
 | 
						|
@page
 | 
						|
@node iconv configuration
 | 
						|
@section iconv configuration
 | 
						|
@findex iconv configuration
 | 
						|
@findex --enable-newlib-iconv-encodings
 | 
						|
@findex --enable-newlib-iconv-from-encodings
 | 
						|
@findex --enable-newlib-iconv-to-encodings
 | 
						|
@findex --enable-newlib-iconv-external-ccs
 | 
						|
@findex NLSPATH
 | 
						|
@*
 | 
						|
To enable an encoding, the @emph{--enable-newlib-iconv-encodings} configure
 | 
						|
script option should be used. This option accepts a comma-separated list
 | 
						|
of @emph{encodings} that should be enabled. The option enables each encoding in both
 | 
						|
("to" and "from") directions.
 | 
						|
 | 
						|
@*
 | 
						|
The @option{--enable-newlib-iconv-from-encodings} configure script option enables
 | 
						|
"from" support for each encoding that was passed to it.
 | 
						|
 | 
						|
@*
 | 
						|
The @option{--enable-newlib-iconv-to-encodings} configure script option enables
 | 
						|
"to" support for each encoding that was passed to it.
 | 
						|
 | 
						|
@*
 | 
						|
Example: if user plans only the "KOI8-R -> UTF-8", "UTF-8 -> ISO-8859-5" and
 | 
						|
"KOI8-R -> UCS-2" conversions, the most optimal way (minimal iconv
 | 
						|
code and data will be linked) is to configure Newlib with the following
 | 
						|
options:
 | 
						|
@*
 | 
						|
@code{--enable-newlib-iconv-encodings=UTF-8
 | 
						|
--enable-newlib-iconv-from-encodings=KOI8-R
 | 
						|
--enable-newlib-iconv-to-encodings=UCS-2,ISO-8859-5}
 | 
						|
@*
 | 
						|
which is the same as
 | 
						|
@*
 | 
						|
@code{--enable-newlib-iconv-from-encodings=KOI8-R,UTF-8
 | 
						|
--enable-newlib-iconv-to-encodings=UCS-2,ISO-8859-5,UTF-8}
 | 
						|
@*
 | 
						|
User may also just use the
 | 
						|
@*
 | 
						|
@code{--enable-newlib-iconv-encodings=KOI8-R,ISO-8859-5,UTF-8,UCS-2}
 | 
						|
@*
 | 
						|
configure script option, but it isn't so optimal since there will be
 | 
						|
some unneeded data and code.
 | 
						|
 | 
						|
@*
 | 
						|
The @option{--enable-newlib-iconv-external-ccs} option enables iconv's
 | 
						|
capabilities to work with the external CCS files.
 | 
						|
 | 
						|
@*
 | 
						|
The @option{--enable-target-optspace} Newlib configure script option also affects
 | 
						|
the iconv library. If this option is present, the library uses the size
 | 
						|
optimized CCS tables. This means, that only the size-optimized CCS
 | 
						|
tables will be linked or, if the
 | 
						|
@option{--enable-newlib-iconv-external-ccs} configure script option was used,
 | 
						|
the iconv library will load the size-optimized tables. If the
 | 
						|
@option{--enable-target-optspace}configure script option is disabled,
 | 
						|
the speed-optimized CCS tables are used.
 | 
						|
 | 
						|
@*
 | 
						|
Note: .cct files are searched by iconv_open in the $NLSPATH/iconv_data/ directory.
 | 
						|
Thus, the NLSPATH environment variable should be set.
 | 
						|
 | 
						|
 | 
						|
 | 
						|
 | 
						|
 | 
						|
@page
 | 
						|
@node Encoding names
 | 
						|
@section Encoding names
 | 
						|
@findex encoding name
 | 
						|
@findex encoding alias
 | 
						|
@findex normalized name
 | 
						|
@*
 | 
						|
Each encoding has one @dfn{name} and a number of @dfn{aliases}. When
 | 
						|
user works with the iconv library (i.e., when the @code{iconv_open} call
 | 
						|
is used) both name or aliases may be used. The same is when encoding
 | 
						|
names are used in configure script options.
 | 
						|
 | 
						|
@*
 | 
						|
Names and aliases may be specified in any case (small or capital
 | 
						|
letters) and the @kbd{-} symbol is equivalent to the @kbd{_} symbol.
 | 
						|
 | 
						|
@*
 | 
						|
Internally the Newlib iconv library always converts aliases to names. It
 | 
						|
also converts names and aliases in the @dfn{normalized} form which means
 | 
						|
that all capital letters are converted to small letters and the @kbd{-}
 | 
						|
symbols are converted to @kbd{_} symbols.
 | 
						|
 | 
						|
 | 
						|
 | 
						|
 | 
						|
@page
 | 
						|
@node CCS tables
 | 
						|
@section CCS tables
 | 
						|
@findex Size-optimized CCS table
 | 
						|
@findex Speed-optimized CCS table
 | 
						|
@findex mktbl.pl Perl script
 | 
						|
@findex .cct files
 | 
						|
@findex The CCT tables source files
 | 
						|
@findex CCS source files
 | 
						|
@*
 | 
						|
The iconv library stores files with CCS tables in the the @emph{ccs/}
 | 
						|
subdirectory. The CCS tables for any CCS may be kept in two forms - in the binary form
 | 
						|
(@dfn{.cct files}, see the @emph{ccs/binary/} subdirectory) and in form
 | 
						|
of compilable .c source files. The .cct files are only used when the
 | 
						|
@option{--enable-newlib-iconv-external-ccs} configure script option is enabled.
 | 
						|
The .c files are linked to the Newlib library if the corresponding
 | 
						|
encoding is enabled.
 | 
						|
 | 
						|
@*
 | 
						|
As stated earlier, the Newlib iconv library performs all
 | 
						|
conversions through the 32-bit UCS, but the codes which are used
 | 
						|
in most CCS-es, fit into the first 16-bit subset of the 32-bit UCS set.
 | 
						|
Thus, in order to make the CCS tables more compact, the 16-bit UCS-2 is
 | 
						|
used instead of the 32-bit UCS-4.
 | 
						|
 | 
						|
@*
 | 
						|
CCS tables may be 8- or 16-bit wide. 8-bit CCS tables map 8-bit CCS to
 | 
						|
16-bit UCS-2 and vice versa while 16-bit CCS tables map
 | 
						|
16-bit CCS to 16-bit UCS-2 and vice versa.
 | 
						|
8-bit tables are small (in size) while 16-bit tables may be big enough.
 | 
						|
Because of this, 16-bit CCS tables may be
 | 
						|
either speed- or size-optimized. Size-optimized CCS tables are
 | 
						|
smaller then speed-optimized ones, but the conversion process is
 | 
						|
slower if the size-optimized CCS tables are used. 8-bit CCS tables have only
 | 
						|
size-optimized variant.
 | 
						|
 | 
						|
Each CCS table (both speed- and size-optimized) consists of
 | 
						|
@dfn{from_ucs} and @dfn{to_ucs} subtables. "from_ucs" subtable maps
 | 
						|
UCS-2 codes to CCS codes, while "to_ucs" subtable maps CCS codes to
 | 
						|
UCS-2 codes.
 | 
						|
 | 
						|
@*
 | 
						|
Almost all 16-bit CCS tables contain less then 0xFFFF codes and
 | 
						|
a lot of gaps exist.
 | 
						|
 | 
						|
@subsection Speed-optimized tables format
 | 
						|
@*
 | 
						|
In case of 8-bit speed-optimized CCS tables the "to_ucs" subtables format is
 | 
						|
trivial - it is just the array of 256 16-bit UCS codes. Therefore, an
 | 
						|
UCS-2 code @emph{Y} corresponding to a @emph{X} CCS code is calculates
 | 
						|
as @emph{Y = to_ucs[X]}.
 | 
						|
 | 
						|
@*
 | 
						|
Obviously, the simplest way to create the "from_ucs" table or the
 | 
						|
16-bit "to_ucs" table is to use the huge 16-bit array like in case
 | 
						|
of the 8-bit "to_ucs" table. But almost all the 16-bit CCS tables contain
 | 
						|
less then 0xFFFF code maps and this fact may be exploited to reduce
 | 
						|
the size of the CCS tables.
 | 
						|
 | 
						|
@*
 | 
						|
In this chapter the "UCS-2 -> CCS" 8-bit CCS table format is described. The
 | 
						|
16-bit "CCS -> UCS-2" CCS table format is the same, except the mapping
 | 
						|
direction and the CCS bits number.
 | 
						|
 | 
						|
@*
 | 
						|
In case of the 8-bit speed-optimized table the "from_ucs" subtable
 | 
						|
corresponds the "from_ucs" array and has the following layout:
 | 
						|
 | 
						|
@*
 | 
						|
from_ucs array:
 | 
						|
@*
 | 
						|
-------------------------------------
 | 
						|
@*
 | 
						|
0xFF mapping (2 bytes) (only for
 | 
						|
8-bit table).
 | 
						|
@*
 | 
						|
-------------------------------------
 | 
						|
@*
 | 
						|
Heading block
 | 
						|
@*
 | 
						|
-------------------------------------
 | 
						|
@*
 | 
						|
Block 1
 | 
						|
@*
 | 
						|
-------------------------------------
 | 
						|
@*
 | 
						|
Block 2
 | 
						|
@*
 | 
						|
-------------------------------------
 | 
						|
@*
 | 
						|
  ...
 | 
						|
@*
 | 
						|
-------------------------------------
 | 
						|
@*
 | 
						|
Block N
 | 
						|
@*
 | 
						|
-------------------------------------
 | 
						|
 | 
						|
@*
 | 
						|
The 0x0000-0xFFFF 16-bit code range is divided to 256 code subranges. Each
 | 
						|
subrange is represented by an 256-element @dfn{block} (256 1-byte
 | 
						|
elements or 256 2-byte element in case of 16-bit CCS table) with
 | 
						|
elements which are equivalent to the CCS codes of this subrange.
 | 
						|
If the "UCS-2 -> CCS" mapping has big enough gaps, some blocks will be
 | 
						|
absent and there will be less then 256 blocks.
 | 
						|
 | 
						|
@*
 | 
						|
Any element number @emph{m} of @dfn{the heading block} (which contains
 | 
						|
256 2-byte elements) corresponds to the @emph{m}-th 256-element subrange.
 | 
						|
If the subrange contains some codes, the value of the @emph{m}-th element of
 | 
						|
the heading block contains the offset of the corresponding block in the
 | 
						|
"from_ucs" array. If there is no codes in the subrange, the heading
 | 
						|
block element contains 0xFFFF.
 | 
						|
 | 
						|
@*
 | 
						|
If there are some gaps in a block, the corresponding block elements have
 | 
						|
the 0xFF value. If there is an 0xFF code present in the CCS, it's mapping
 | 
						|
is defined in the first 2-byte element of the "from_ucs" array.
 | 
						|
 | 
						|
@*
 | 
						|
Having such a table format, the algorithm of searching the CCS code
 | 
						|
@emph{X} which corresponds to the UCS-2 code @emph{Y} is as follows.
 | 
						|
 | 
						|
@*
 | 
						|
@enumerate
 | 
						|
@item If @emph{Y} is equivalent to the value of the first 2-byte element
 | 
						|
of the "from_ucs" array, @emph{X} is 0xFF. Else, continue to search.
 | 
						|
 | 
						|
@item Calculate the block number: @emph{BlkN = (Y & 0xFF00) >> 8}.
 | 
						|
 | 
						|
@item If the heading block element with number @emph{BlkN} is 0xFFFF, there
 | 
						|
is no corresponding CCS code (error, wrong input data). Else, fetch the
 | 
						|
"flom_ucs" array index of the @emph{BlkN}-th block.
 | 
						|
 | 
						|
@item Calculate the offset of the @emph{X} code in its block: 
 | 
						|
@emph{Xindex = Y & 0xFF}
 | 
						|
 | 
						|
@item If the @emph{Xindex}-th element of the block (which is equivalent to
 | 
						|
@emph{from_ucs[BlkN+Xindex]}) value is 0xFF, there is no corresponding
 | 
						|
CCS code (error, wrong input data). Else, @emph{X = from_ucs[BlkN+Xindex]}.
 | 
						|
@end enumerate
 | 
						|
 | 
						|
@subsection Size-optimized tables format
 | 
						|
@*
 | 
						|
As it is stated above, size-optimized tables exist only for 16-bit CCS-es.
 | 
						|
This is because there is too small difference between the speed-optimized
 | 
						|
and the size-optimized table sizes in case of 8-bit CCS-es.
 | 
						|
 | 
						|
@*
 | 
						|
Formats of the "to_ucs" and "from_ucs" subtables are equivalent in case of
 | 
						|
size-optimized tables.
 | 
						|
 | 
						|
This sections describes the format of the "UCS-2 -> CCS" size-optimized
 | 
						|
CCS table. The format of "CCS -> UCS-2" table is the same.
 | 
						|
 | 
						|
The idea of the size-optimized tables is to split the UCS-2 codes
 | 
						|
("from" codes) on @dfn{ranges} (@dfn{range} is a number of consecutive UCS-2 codes).
 | 
						|
Then CCS codes ("to" codes) are stored only for the codes from these
 | 
						|
ranges. Distinct "from" codes, which have no range (@dfn{unranged codes}, are stored
 | 
						|
together with the corresponding "to" codes.
 | 
						|
 | 
						|
@*
 | 
						|
The following is the layout of the size-optimized table array:
 | 
						|
 | 
						|
@*
 | 
						|
size_arr array:
 | 
						|
@*
 | 
						|
-------------------------------------
 | 
						|
@*
 | 
						|
Ranges number (2 bytes)
 | 
						|
@*
 | 
						|
-------------------------------------
 | 
						|
@*
 | 
						|
Unranged codes number (2 bytes)
 | 
						|
@*
 | 
						|
-------------------------------------
 | 
						|
@*
 | 
						|
Unranged codes array index (2 bytes)
 | 
						|
@*
 | 
						|
-------------------------------------
 | 
						|
@*
 | 
						|
Ranges indexes (triads)
 | 
						|
@*
 | 
						|
-------------------------------------
 | 
						|
@*
 | 
						|
Ranges
 | 
						|
@*
 | 
						|
-------------------------------------
 | 
						|
@*
 | 
						|
Unranged codes array
 | 
						|
@*
 | 
						|
-------------------------------------
 | 
						|
 | 
						|
@*
 | 
						|
The @dfn{Unranged codes array index} @emph{size_arr} section helps to find
 | 
						|
the offset of the needed range in the @emph{size_arr} and has
 | 
						|
the following format (triads):
 | 
						|
@*
 | 
						|
the first code in range, the last code in range, range offset.
 | 
						|
 | 
						|
@*
 | 
						|
The array of these triads is sorted by the firs element, therefore it is
 | 
						|
possible to quickly find the needed range index.
 | 
						|
 | 
						|
@*
 | 
						|
Each range has the corresponding sub-array containing the "to" codes. These
 | 
						|
sub-arrays are stored in the place marked as "Ranges" in the layout
 | 
						|
diagram. 
 | 
						|
 | 
						|
@*
 | 
						|
The "Unranged codes array" contains pairs ("from" code, "to" code") for
 | 
						|
each unranged code. The array of these pairs is sorted by "from" code
 | 
						|
values, therefore it is possible to find the needed pair quickly.
 | 
						|
 | 
						|
@*
 | 
						|
Note, that each range requires 6 bytes to form its index. If, for
 | 
						|
example, there are two ranges (1 - 5 and 9 - 10), and one unranged code
 | 
						|
(7), 12 bytes are needed for two range indexes and 4 bytes for the unranged
 | 
						|
code (total 16). But it is better to join both ranges as 1 - 10 and
 | 
						|
mark codes 6 and 8 as absent. In this case, only 6 additional bytes for the
 | 
						|
range index and 4 bytes to mark codes 6 and 8 as absent are needed
 | 
						|
(total 10 bytes). This optimization is done in the size-optimized tables.
 | 
						|
Thus, ranges may contain small gaps. The absent codes in ranges are marked
 | 
						|
as 0xFFFF.
 | 
						|
 | 
						|
@*
 | 
						|
Note, a pair of "from" codes is stored by means of unranged codes since
 | 
						|
the number of bytes which are needed to form the range is greater than
 | 
						|
the number of bytes to store two unranged codes (5 against 4).
 | 
						|
 | 
						|
@*
 | 
						|
The algorithm of searching of the CCS code
 | 
						|
@emph{X} which corresponds to the UCS-2 code @emph{Y} (input) in the "UCS-2 ->
 | 
						|
CCS" size-optimized table is as follows.
 | 
						|
 | 
						|
@*
 | 
						|
@enumerate
 | 
						|
@item Try to find the corresponding triad in the "Unranged codes array
 | 
						|
index". Since we are searching in the sorted array, we can do it quickly
 | 
						|
(divide by 2, compare, etc).
 | 
						|
 | 
						|
@item If the triad is found, fetch the @emph{X} code from the corresponding
 | 
						|
range array. If it is 0xFFFF, return an error.
 | 
						|
 | 
						|
@item If there is no corresponding triad, search the @emph{X} code among the
 | 
						|
sorted unranged codes. Return error, if noting was found.
 | 
						|
@end enumerate
 | 
						|
 | 
						|
@subsection .cct ant .c CCS Table files
 | 
						|
@*
 | 
						|
The .c source files for 8-bit CCS tables have "to_ucs" and "from_ucs"
 | 
						|
speed-optimized tables. The .c source files for 16-bit CCS tables have
 | 
						|
"to_ucs_speed", "to_ucs_size", "from_ucs_speed" and "from_ucs_size"
 | 
						|
tables.
 | 
						|
 | 
						|
@*
 | 
						|
When .c files are compiled and used, all the 16-bit and 32-bit values
 | 
						|
have the native endian format (Big Endian for the BE systems and Little
 | 
						|
Endian for the LE systems) since they are compile for the system before
 | 
						|
they are used.
 | 
						|
 | 
						|
@*
 | 
						|
In case of .cct files, which are intended for dynamic CCS tables
 | 
						|
loading, the CCS tables are stored either in LE or BE format. Since the
 | 
						|
.cct files are generated by the 'mktbl.pl' Perl script, it is possible
 | 
						|
to choose the endianess of the tables. It is also possible to store two
 | 
						|
copies (both LE and BE) of the CCS tables in one .cct file. The default
 | 
						|
.cct files (which come with the Newlib sources) have both LE and BE CCS
 | 
						|
tables. The Newlib iconv library automatically chooses the needed CCS tables
 | 
						|
(with appropriate endianess).
 | 
						|
 | 
						|
@*
 | 
						|
Note, the .cct files are only used when the
 | 
						|
@option{--enable-newlib-iconv-external-ccs} is used.
 | 
						|
 | 
						|
@subsection The 'mktbl.pl' Perl script
 | 
						|
@*
 | 
						|
The 'mktbl.pl' script is intended to generate .cct and .c CCS table
 | 
						|
files from the @dfn{CCS source files}.
 | 
						|
 | 
						|
@*
 | 
						|
The CCS source files are just text files which has one or more colons
 | 
						|
with CCS <-> UCS-2 codes mapping. To see an example of the CCS table
 | 
						|
source files see one of them using URL-s which will be given bellow.
 | 
						|
 | 
						|
@*
 | 
						|
The following table describes where the source files for CCS table files
 | 
						|
provided by the Newlib distribution are located.
 | 
						|
 | 
						|
@multitable @columnfractions .25 .75
 | 
						|
@item
 | 
						|
Name
 | 
						|
@tab
 | 
						|
URL
 | 
						|
 | 
						|
@item
 | 
						|
@tab
 | 
						|
 | 
						|
@item
 | 
						|
big5
 | 
						|
@tab
 | 
						|
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT
 | 
						|
 | 
						|
@item
 | 
						|
cns11643_plane1
 | 
						|
cns11643_plane14
 | 
						|
cns11643_plane2
 | 
						|
@tab
 | 
						|
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/CNS11643.TXT
 | 
						|
 | 
						|
@item
 | 
						|
cp775
 | 
						|
cp850
 | 
						|
cp852
 | 
						|
cp855
 | 
						|
cp866
 | 
						|
@tab
 | 
						|
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/
 | 
						|
 | 
						|
@item
 | 
						|
iso_8859_1
 | 
						|
iso_8859_2
 | 
						|
iso_8859_3
 | 
						|
iso_8859_4
 | 
						|
iso_8859_5
 | 
						|
iso_8859_6
 | 
						|
iso_8859_7
 | 
						|
iso_8859_8
 | 
						|
iso_8859_9
 | 
						|
iso_8859_10
 | 
						|
iso_8859_11
 | 
						|
iso_8859_13
 | 
						|
iso_8859_14
 | 
						|
iso_8859_15
 | 
						|
@tab
 | 
						|
http://www.unicode.org/Public/MAPPINGS/ISO8859/
 | 
						|
 | 
						|
@item
 | 
						|
iso_ir_111
 | 
						|
@tab
 | 
						|
http://crl.nmsu.edu/~mleisher/csets/ISOIR111.TXT
 | 
						|
 | 
						|
@item
 | 
						|
jis_x0201_1976
 | 
						|
jis_x0208_1990
 | 
						|
jis_x0212_1990
 | 
						|
@tab
 | 
						|
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0201.TXT
 | 
						|
 | 
						|
@item
 | 
						|
koi8_r
 | 
						|
@tab
 | 
						|
http://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-R.TXT
 | 
						|
 | 
						|
@item
 | 
						|
koi8_ru
 | 
						|
@tab
 | 
						|
http://crl.nmsu.edu/~mleisher/csets/KOI8RU.TXT
 | 
						|
 | 
						|
@item
 | 
						|
koi8_u
 | 
						|
@tab
 | 
						|
http://crl.nmsu.edu/~mleisher/csets/KOI8U.TXT
 | 
						|
 | 
						|
@item
 | 
						|
koi8_uni
 | 
						|
@tab
 | 
						|
http://crl.nmsu.edu/~mleisher/csets/KOI8UNI.TXT
 | 
						|
 | 
						|
@item
 | 
						|
ksx1001
 | 
						|
@tab
 | 
						|
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSX1001.TXT
 | 
						|
 | 
						|
@item
 | 
						|
win_1250
 | 
						|
win_1251
 | 
						|
win_1252
 | 
						|
win_1253
 | 
						|
win_1254
 | 
						|
win_1255
 | 
						|
win_1256
 | 
						|
win_1257
 | 
						|
win_1258
 | 
						|
@tab
 | 
						|
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/
 | 
						|
@end multitable
 | 
						|
 | 
						|
The CCS source files aren't distributed with Newlib because of License
 | 
						|
restrictions in most Unicode.org's files.
 | 
						|
 | 
						|
The following are 'mktbl.pl' options which were used to generate .cct
 | 
						|
files. Note, to generate CCS tables source files @option{-s} option
 | 
						|
should be added.
 | 
						|
 | 
						|
@enumerate
 | 
						|
@item For the iso_8859_10.cct, iso_8859_13.cct, iso_8859_14.cct, iso_8859_15.cct,
 | 
						|
iso_8859_1.cct, iso_8859_2.cct, iso_8859_3.cct, iso_8859_4.cct,
 | 
						|
iso_8859_5.cct, iso_8859_6.cct, iso_8859_7.cct, iso_8859_8.cct,
 | 
						|
iso_8859_9.cct, iso_8859_11.cct, win_1250.cct, win_1252.cct, win_1254.cct
 | 
						|
win_1256.cct, win_1258.cct, win_1251.cct,
 | 
						|
win_1253.cct, win_1255.cct, win_1257.cct,
 | 
						|
koi8_r.cct, koi8_ru.cct, koi8_u.cct, koi8_uni.cct, iso_ir_111.cct,
 | 
						|
big5.cct, cp775.cct, cp850.cct, cp852.cct, cp855.cct, cp866.cct, cns11643.cct
 | 
						|
files, only the @option{-i <SRC_FILE_NAME>} option were used.
 | 
						|
 | 
						|
@item To generate the jis_x0208_1990.cct file, the
 | 
						|
@option{-i jis_x0208_1990.txt -x 2 -y 3} options were used.
 | 
						|
 | 
						|
@item To generate the cns11643_plane1.cct file, the
 | 
						|
@option{-i cns11643.txt -p1 -N cns11643_plane1  -o cns11643_plane1.cct}
 | 
						|
options were used.
 | 
						|
 | 
						|
@item To generate the cns11643_plane2.cct file, the
 | 
						|
@option{-i cns11643.txt -p2 -N cns11643_plane2  -o cns11643_plane2.cct}
 | 
						|
options were used.
 | 
						|
 | 
						|
@item To generate the cns11643_plane14.cct file, the
 | 
						|
@option{-i cns11643.txt -p0xE -N cns11643_plane14  -o cns11643_plane14.cct}
 | 
						|
options were used.
 | 
						|
@end enumerate
 | 
						|
 | 
						|
@*
 | 
						|
For more info about the 'mktbl.pl' options, see the 'mktbl.pl -h' output.
 | 
						|
 | 
						|
@*
 | 
						|
It is assumed that CCS codes are 16 or less bits wide. If there are wider CCS codes
 | 
						|
in the CCS source file, the bits which are higher then 16 defines plane (see the
 | 
						|
cns11643.txt CCS source file).
 | 
						|
 | 
						|
@*
 | 
						|
Sometimes, it is impossible to map some CCS codes to the 16-bit UCS if, for example,
 | 
						|
several different CCS codes are mapped to one UCS-2 code or one CCS code is mapped to
 | 
						|
the pair of UCS-2 codes. In these cases, such CCS codes (@dfn{lost
 | 
						|
codes}) aren't just rejected but instead, they are mapped to the default
 | 
						|
UCS-2 code (which is currently the @kbd{?} character's code).
 | 
						|
 | 
						|
 | 
						|
 | 
						|
 | 
						|
 | 
						|
@page
 | 
						|
@node CES converters
 | 
						|
@section CES converters
 | 
						|
@findex PCS
 | 
						|
@*
 | 
						|
Similar to the CCS tables, CES converters are also split into "from UCS"
 | 
						|
and "to UCS" parts. Depending on the iconv library configuration, these
 | 
						|
parts are enabled or disabled. 
 | 
						|
 | 
						|
@*
 | 
						|
The following it the list of CES converters which are currently present
 | 
						|
in the Newlib iconv library.
 | 
						|
 | 
						|
@itemize @bullet
 | 
						|
@item
 | 
						|
@emph{euc} - supports the @emph{euc_jp}, @emph{euc_kr} and @emph{euc_tw}
 | 
						|
encodings. The @emph{euc} CES converter uses the @emph{table} and the
 | 
						|
@emph{us_ascii} CES converters.
 | 
						|
 | 
						|
@item
 | 
						|
@emph{table} - this CES converter corresponds to "null" and just performs 
 | 
						|
tables-based conversion using 8- and 16-bit CCS tables. This converter
 | 
						|
is also used by any other CES converter which needs the CCS table-based
 | 
						|
conversions. The @emph{table} converter is also responsible for .cct files
 | 
						|
loading.
 | 
						|
 | 
						|
@item
 | 
						|
@emph{table_pcs} - this is the wrapper over the @emph{table} converter
 | 
						|
which is intended for 16-bit encodings which also use the @dfn{Portable
 | 
						|
Character Set} (@dfn{PCS}) which is the same as the @emph{US-ASCII}.
 | 
						|
This means, that if the first byte the CCS code is in range of [0x00-0x7f],
 | 
						|
this is the 7-bit PCS code. Else, this is the 16-bit CCS code. Of course,
 | 
						|
the 16-bit codes must not contain bytes in the range of [0x00-0x7f].
 | 
						|
The @emph{big5} encoding uses the @emph{table_pcs} CES converter and the
 | 
						|
@emph{table_pcs} CES converter depends on the @emph{table} CES converter.
 | 
						|
 | 
						|
@item
 | 
						|
@emph{ucs_2} - intended for the @emph{ucs_2}, @emph{ucs_2be} and
 | 
						|
@emph{ucs_2le} encodings support.
 | 
						|
 | 
						|
@item
 | 
						|
@emph{ucs_4} - intended for the @emph{ucs_4}, @emph{ucs_4be} and
 | 
						|
@emph{ucs_4le} encodings support.
 | 
						|
 | 
						|
@item
 | 
						|
@emph{ucs_2_internal} - intended for the @emph{ucs_2_internal} encoding support.
 | 
						|
 | 
						|
@item
 | 
						|
@emph{ucs_4_internal} - intended for the @emph{ucs_4_internal} encoding support.
 | 
						|
 | 
						|
@item
 | 
						|
@emph{us_ascii} - intended for the @emph{us_ascii} encoding support. In
 | 
						|
principle, the most natural way to support the @emph{us_ascii} encoding
 | 
						|
is to define the @emph{us_ascii} CCS and use the @emph{table} CES
 | 
						|
converter. But for the optimization purposes, the specialized
 | 
						|
@emph{us_ascii} CES converter was created.
 | 
						|
 | 
						|
@item
 | 
						|
@emph{utf_16} - intended for the @emph{utf_16}, @emph{utf_16be} and
 | 
						|
@emph{utf_16le} encodings support.
 | 
						|
 | 
						|
@item
 | 
						|
@emph{utf_8} - intended for the @emph{utf_8} encoding support.
 | 
						|
@end itemize
 | 
						|
 | 
						|
 | 
						|
 | 
						|
 | 
						|
 | 
						|
@page
 | 
						|
@node The encodings description file
 | 
						|
@section The encodings description file
 | 
						|
@findex encoding.deps description file
 | 
						|
@findex mkdeps.pl Perl script
 | 
						|
@*
 | 
						|
To simplify the process of adding new encodings support allowing to
 | 
						|
automatically generate a lot of "glue" files.
 | 
						|
 | 
						|
@*
 | 
						|
There is the 'encoding.deps' file in the @emph{lib/} subdirectory which
 | 
						|
is used to describe encoding's properties. The 'mkdeps.pl' Perl script
 | 
						|
uses 'encoding.deps' to generates the "glue" files.
 | 
						|
 | 
						|
@*
 | 
						|
The 'encoding.deps' file is composed of sections, each section consists
 | 
						|
of entries, each entry contains some encoding/CES/CCS description. 
 | 
						|
 | 
						|
@*
 | 
						|
The 'encoding.deps' file's syntax is very simple. Currently only two
 | 
						|
sections are defined: @emph{ENCODINGS} and @emph{CES_DEPENDENCIES}.
 | 
						|
 | 
						|
@*
 | 
						|
Each @emph{ENCODINGS} section's entry describes one encoding and
 | 
						|
contains the following information.
 | 
						|
 | 
						|
@itemize @bullet
 | 
						|
@item
 | 
						|
Encoding name (the @emph{ENCODING} field). The name should
 | 
						|
be unique and only one name is possible.
 | 
						|
 | 
						|
@item
 | 
						|
The encoding's CES converter name (the @emph{CES} field). Only one CES
 | 
						|
converter is allowed.
 | 
						|
 | 
						|
@item
 | 
						|
The whitespace-separated list of CCS table names which are used by the
 | 
						|
encoding (the @emph{CCS} field).
 | 
						|
 | 
						|
@item
 | 
						|
The whitespace-separated list of aliases names (the @emph{ENCODING}
 | 
						|
field).
 | 
						|
@end itemize
 | 
						|
 | 
						|
@*
 | 
						|
Note all names in the 'encoding.deps' file have to have the normalized
 | 
						|
form.
 | 
						|
 | 
						|
@*
 | 
						|
Each @emph{CES_DEPENDENCIES} section's entry describes dependencies of
 | 
						|
one CES converted. For example, the @emph{euc} CES converter depends on
 | 
						|
the @emph{table} and the @emph{us_ascii} CES converter since the
 | 
						|
@emph{euc} CES converter uses them. This means, that both @emph{table}
 | 
						|
and @emph{us_ascii} CES converters should be linked if the @emph{euc}
 | 
						|
CES converter is enabled.
 | 
						|
 | 
						|
@*
 | 
						|
The @emph{CES_DEPENDENCIES} section defines the following:
 | 
						|
 | 
						|
@itemize @bullet
 | 
						|
@item
 | 
						|
the CES converter name for which the dependencies are defined in this
 | 
						|
entry (the @emph{CES} field);
 | 
						|
 | 
						|
@item
 | 
						|
the whitespace-separated list of CES converters which are needed for
 | 
						|
this CES converter (the @emph{USED_CES} field).
 | 
						|
@end itemize
 | 
						|
 | 
						|
@*
 | 
						|
The 'mktbl.pl' Perl script automatically solves the following tasks.
 | 
						|
 | 
						|
@itemize @bullet
 | 
						|
@item
 | 
						|
User works with the iconv library in terms of encodings and doesn't know
 | 
						|
anything about CES converters and CCS tables. The script automatically
 | 
						|
generates code which enables all needed CES converters and CCS tables
 | 
						|
for all encodings, which were enabled by the user.
 | 
						|
 | 
						|
@item
 | 
						|
The CES converters may have dependencies and the script automatically
 | 
						|
generates the code which handles these dependencies.
 | 
						|
 | 
						|
@item
 | 
						|
The list of encoding's aliases is also automatically generated.
 | 
						|
 | 
						|
@item
 | 
						|
The script uses a lot of macros in order to enable only the minimum set
 | 
						|
of code/data which is needed to support the requested encodings in the
 | 
						|
requested directions.
 | 
						|
@end itemize
 | 
						|
 | 
						|
@*
 | 
						|
The 'mktbl.pl' Perl script is intended to interpret the 'encoding.deps'
 | 
						|
file and generates the following files.
 | 
						|
 | 
						|
@itemize @bullet
 | 
						|
@item
 | 
						|
@emph{lib/encnames.h} - this header files contains macro definitions for all
 | 
						|
encoding names
 | 
						|
 | 
						|
@item
 | 
						|
@emph{lib/aliasesbi.c} - the array of encoding names and aliases. The array
 | 
						|
is used to find the name of requested encoding by it's alias.
 | 
						|
 | 
						|
@item
 | 
						|
@emph{ces/cesbi.c} - this file defines two arrays
 | 
						|
(@code{_iconv_from_ucs_ces} and @code{_iconv_to_ucs_ces}) which contain
 | 
						|
description of enabled "to UCS" and "from UCS" CES converters and the
 | 
						|
names of encodings which are supported by these CES converters.
 | 
						|
 | 
						|
@item
 | 
						|
@emph{ces/cesbi.h} - this file contains the set of macros which defines
 | 
						|
the set of CES converters which should be enabled if only the set of
 | 
						|
enabled encodings is given (through macros defined in the
 | 
						|
@emph{newlib.h} file). Note, that one CES converter may handle several
 | 
						|
encodings.
 | 
						|
 | 
						|
@item
 | 
						|
@emph{ces/cesdeps.h} - the CES converters dependencies are handled in
 | 
						|
this file.
 | 
						|
 | 
						|
@item
 | 
						|
@emph{ccs/ccsdeps.h} - the array of linked-in CCS tables is defined
 | 
						|
here.
 | 
						|
 | 
						|
@item
 | 
						|
@emph{ccs/ccsnames.h} - this header files contains macro definitions for all
 | 
						|
CCS names.
 | 
						|
 | 
						|
@item
 | 
						|
@emph{encoding.aliases} - the list of supported encodings and their
 | 
						|
aliases which is intended for the Newlib configure scripts in order to
 | 
						|
handle the iconv-related configure script options.
 | 
						|
@end itemize
 | 
						|
 | 
						|
 | 
						|
 | 
						|
 | 
						|
 | 
						|
@page
 | 
						|
@node How to add new encoding
 | 
						|
@section How to add new encoding
 | 
						|
@*
 | 
						|
At first, the new encoding should be broken down to CCS and CES. Then,
 | 
						|
the process of adding new encoding is split to the following activities.
 | 
						|
 | 
						|
@enumerate
 | 
						|
@item Generate the .cct CCS file and the .c source file for the new
 | 
						|
encoding's CCS (if it isn't already present). To do this, the CCS source
 | 
						|
file should be had and the 'mktbl.pl' script should be used.
 | 
						|
 | 
						|
@item Write the corresponding CES converter (if it isn't already
 | 
						|
present). Use the existing CES converters as an example.
 | 
						|
 | 
						|
@item
 | 
						|
Add the corresponding entries to the 'encoding.deps' file and regenerate
 | 
						|
the autogenerated "glue" files using the 'mkdeps.pl' script.
 | 
						|
 | 
						|
@item
 | 
						|
Don't forget to add entries to the newlib/newlib.hin file.
 | 
						|
 | 
						|
@item
 | 
						|
Of course, the 'Makefile.am'-s should also be updated (if new files were
 | 
						|
added) and the 'Makefile.in'-s should be regenerated using the correct
 | 
						|
version of 'automake'.
 | 
						|
 | 
						|
@item
 | 
						|
Don't forget to update the documentation (the list of
 | 
						|
supported encodings and CES converters).
 | 
						|
@end enumerate
 | 
						|
 | 
						|
In case a new encoding doesn't fit to the CES/CCS decomposition model or
 | 
						|
it is desired to add the specialized (non UCS-based) conversion support,
 | 
						|
the Newlib iconv library code should be upgraded.
 | 
						|
 | 
						|
 | 
						|
 | 
						|
 | 
						|
 | 
						|
@page
 | 
						|
@node The locale support interfaces
 | 
						|
@section The locale support interfaces
 | 
						|
@*
 | 
						|
The newlib iconv library also has some interface functions (besides the
 | 
						|
@code{iconv}, @code{iconv_open} and @code{iconv_close} interfaces) which
 | 
						|
are intended for the Locale subsystem. All the locale-related code is
 | 
						|
placed in the @emph{lib/iconvnls.c} file.
 | 
						|
 | 
						|
@*
 | 
						|
The following is the description of the locale-related interfaces:
 | 
						|
 | 
						|
@itemize @bullet
 | 
						|
@item
 | 
						|
@code{_iconv_nls_open} - opens two iconv descriptors for "CCS ->
 | 
						|
wchar_t" and "wchar_t -> CCS" conversions. The normalized CCS name is
 | 
						|
passed in the function parameters. The @emph{wchar_t} characters encoding is
 | 
						|
either ucs_2_internal or ucs_4_internal depending on size of
 | 
						|
@emph{wchar_t}.
 | 
						|
 | 
						|
@item
 | 
						|
@code{_iconv_nls_conv} - the function is similar to the @code{iconv}
 | 
						|
functions, but if there is no character in the output encoding which
 | 
						|
corresponds to the character in the input encoding, the default
 | 
						|
conversion isn't performed (the @code{iconv} function sets such output
 | 
						|
characters to the @kbd{?} symbol and this is the behavior, which is
 | 
						|
specified in SUSv3).
 | 
						|
 | 
						|
@item
 | 
						|
@code{_iconv_nls_get_state} - returns the current encoding's shift state
 | 
						|
(the @code{mbstate_t} object).
 | 
						|
 | 
						|
@item
 | 
						|
@code{_iconv_nls_set_state} sets the current encoding's shift state (the
 | 
						|
@code{mbstate_t} object).
 | 
						|
 | 
						|
@item
 | 
						|
@code{_iconv_nls_is_stateful} - checks whether the encoding is stateful
 | 
						|
or stateless.
 | 
						|
 | 
						|
@item
 | 
						|
@code{_iconv_nls_get_mb_cur_max} - returns the maximum length (the
 | 
						|
maximum bytes number) of the encoding's characters.
 | 
						|
@end itemize
 | 
						|
 | 
						|
 | 
						|
 | 
						|
 | 
						|
@page
 | 
						|
@node Contact
 | 
						|
@section Contact
 | 
						|
@*
 | 
						|
The author of the original BSD iconv library (Alexander Chuguev) no longer
 | 
						|
supports that code.
 | 
						|
 | 
						|
@*
 | 
						|
Any questions regarding the iconv library may be forwarded to
 | 
						|
Artem B. Bityuckiy (dedekind@@oktetlabs.ru or dedekind@@mail.ru) as
 | 
						|
well as to the public Newlib mailing list.
 | 
						|
 |