ext|by [ free software: libcyrillic | jtop | jps | jkill | jls ]


libcyrillic

NOTES

This is a library that is used to automatically detect russian text encodings* and to perform different service tasks on the text like conversion between the known charsets.

All most frequently used russian charsets are supported (windows-1251, koi8-r, koi8-u, iso-8859-5, x-mac-cyrillic and ibm866). Some nice features are to be done. Unicode support is also planned.

This library is not "string-aware". You _MUST_ specify block size in all cases when you like to perform some library tasks. This library has some wrappers to its own functions to simplify some tasks.

Like any other piece of software the library comes with NO WARRANTY.

* - frequently "encoding" means "charset" in this document and comments.

DOWNLOAD

INSTALL

  tar xvzf cyrillic.tar.gz
  cd cyrillic
  make install

or

  tar xvzf cyrillic.tar.gz
  cd cyrillic
  make install PFX=/usr/local

DEINSTALL

  tar xvzf cyrillic.tar.gz
  cd cyrillic
  make deinstall

or

  tar xvzf cyrillic.tar.gz
  cd cyrillic
  make deinstall PFX=/usr/local

or


USAGE

To be able to use this library use the following headers:

  include <cyrillic.h>
  include <cyrillic_export.h>

Compile your software as follows:

  cc ... -lcyrillic

FUNCTIONALITY

  char *_cyr_convert(char *buffer,unsigned long size,const char *table)

Converts size bytes inside the buffer using table as mapping.

Returns always pointer to the buffer.

  unsigned int _cyr_convert_char(unsigned int c,const char *table)

Converts using table as mapping.

Returns converted value.

  int cyr_translate_src_encoding(const char *table)

Returns the numeric value of the source charset for the symbolic value named table or "CYR_TABLE_UNKNOWN" when the charset is unknown. This value is later used to select proper conversion mapping.

  int cyr_translate_dst_encoding(const char *table)

Returns the numeric value for the destination charset for the symbolic value named table or "CYR_TABLE_UNKNOWN" when the charset is unknown. This value is later used to select proper conversion mapping.

  char *cyr_convert(char *buffer,unsigned long size,int table)

This is a wrapper for "_cyr_convert" that uses numeric table values.

  unsigned int cyr_convert_char(unsigned int c,int table)

This is a wrapper for "_cyr_convert_char" that uses numeric table values.

  char *cyr_convert_dual(char *buffer,unsigned long size,const char *table_src,const char *table_dst)

Converts size bytes inside the buffer from charset "table_src" to charset "table_dst". It used global flags as options that define the conversion behavior for unknown charsets*.

  char *cyr_convert_dualSE(char *buffer,unsigned long size,const char *table_src)

This is a wrapper for "cyr_convert_dual" that uses saved charset*.

  const char *cyr_convert_dualA(const char *buffer,unsigned long size,const char *table_src,const char *table_dst)

This is a wrapper for "cyr_convert_dual" that allocates a new memory chunk because some buffers can not be modified. This is useful when adding cyrillic support into already existing code.

If you don't like "malloc" use primary functions.

You _MUST_ free the returned pointer after usage.

  const char *cyr_convert_dualASE(const char *buffer,unsigned long size,const char *table_src)

This is the same like "cyr_convert_dualSE" but the wrapper for "cyr_convert_dualA".

  const char *cyr_getrfc2047charset(const char *buffer)

Service function that returns a pointer to the symbolic charset name in rfc2047 encoded string (e.g. =?koi8-r... or "=?koi8-r...) so it can be feed into some of conversion functions.

  unsigned long _cyr_score_stats(const char *table)

Calculates the score for the collected statistics for a specified table.

  int _cyr_detect_encoding()

Detects charset according to the collected statistics.

Returns numeric value.

  int _cyr_detect_buffer_encoding(const char *buffer,unsigned long size)

Makes the same for size bytes in the buffer. All previously collected statistics flushed before the data is being analyzed. This is basically a wrapper for "_cyr_detect_encoding".

  const char *cyr_detect_encoding()

The same as "_cyr_detect_encoding" except that it returns symbolic value.

  const char *cyr_detect_buffer_encoding(const char *buffer,unsigned long size)

The same as "_cyr_detect_buffer_encoding" except that it returns symbolic value.

  void cyr_flush_encoding_stats()

Resets collected statistics.

  void cyr_collect_encoding_stats(const char *buffer,unsigned long size)

Collects statistics data for size bytes of the buffer.

* - see "BEHAVIOR" section.

More TBD.

BEHAVIOR

This library contains a number of functions to operate with russian texts in different charsets. Operations like converting data from one encoding to another are supported along with the nice feature to detect the encoding of any russian text block.

The current behavior when you like to convert the data from one encoding to another is to convert data into "dos" table 1st and then to convert the data into desired table. Later we should reconsider this and speed up by using a bit more complicated static conversion tables that will allow avoiding such a double conversion. Although in this case a number of tables will dramatically grow. It will be 6^2 instead of 6*2 number of conversion tables and will require a bit more complex analysis mode. When you like you can use "low level" functions instead of "high level" wrappers to avoid a behavior like this.

The library has a feature for collecting statistics from a series of russian text chunks (using "cyr_collect_encoding_stats"). "cyr_collect_encoding_stats" uses global statistics buffer "_CYR_ENCODING_STATS" so this is not thread-safe even if you use "high level" functions to collect statistics and then detect encoding. However all "low level" and some "high level" functions that don't use globals are thread-safe. Usage of unsafe functions w/ threading may lead to improper data conversion due to override of global charset flags and options.

For block level processing some global options are used as follows:

  const char *cyr_tmpl_encoding;

This option used in some software that is not released into public domain. You may use it if your software (e.g. web interface) supports templates and you like to convert them from their original encoding into user specified one. This option can be set according to the configuration of your software.

It is not used in the library code.

Undefined by default.

  const char *cyr_mime_encoding;

This option used in some software that is not released into public domain. You may use it in e-mail processing when the current mime block has charset field in its headers. You can set it in the beginning of the mime block body processing and unset it when the processing is done.

When the block charset is unknown this option is used as a last line of defense before defaulting to "cyr_src_encoding". When "cyr_mime_encoding" is not set we use "cyr_src_encoding" instead.

This option is used by "cyr_convert_dual" and dependants.

Undefined by default.

  const char *cyr_src_encoding;

This is a very last line of defense that is used for blocks with either unknown or undetectable charset. This option can be set according to the configuration of your software.

This option is used by "cyr_convert_dual" and dependants.

Undefined by default.

  const char *cyr_dst_encoding;

This option used in some software that is not released into public domain. You may use it to keep in mind the "user defined" output charset. This option can be set according to the configuration of your software.

This option is used by "*SE" functions and dependants.

Undefined by default.

  const char *cyr_det_encoding;

This is an option that can be set according to the configuration of your software.

When this option is set to "_CYR_DET_ENCODING_AUTO" then a try is performed to guess the source encoding of the block within the "cyr_convert_dual". A bit after if it is set to "_CYR_DET_ENCODING_SOFT" and the source charset is unknown another guess try is done. A bit after a second try if the source charset is still unknown and either "cyr_mime_encoding" or "cyr_src_encoding" is specified then source charset is set according to the values of "cyr_mime_encoding" and "cyr_src_encoding" (look above).

Default is "_CYR_DET_ENCODING_AUTO".

TODO

Look into the comments at the top of cyrillic.c :-]


(C) 2001-2003 Pavel Novikov (pavel at ext.by)