You are here: Home V2 Software Software More ... Developer Notes Memops Code Generation Unicode Handling

Unicode Handling

How to handle Unicode - or what happens when somebody uses a non-US-ASCII character

Problem and specification


Introduction

The current character handling is broken. The API allows the setting of characters that can not be read from a saved XML file without causing an error. This makes the project unopenable.

The are several fix variants. Common to all of them is that all String attributes should be unicode strings, exported as such, and coerced to unicode on the way in if necessary. XML implementations should write to files with a unicode-capable encoding. We want UTF-8,that reproduces 7-bit US-ASCII characters as themselves, can handle any unicode value, and is reasonably compact for European scripts.  This will remove all issues of incompatibilities and string conversion errors inside the data model.

NB I do not know the issues for other languages, or for database implementations. please add your information if you have it


We would still need to decide which characters to allow in what attributes. We would also need to control conversion to and from the unicode from applications that need classic strings; this includes keyboard input, print, and file I/O other than to the CCPN storage.

Strings limited to 7-bit US-ASCII are safe in all possible contexts. Unicode is safe once inside the data model, but can cause problems if taken out and used to interact with non-unicode strings or applications. Note that any data that contain non-ASCII characters can break some applications on some systems. No matter what we do there is no way we can print or convert Chinese characters in an environment that does not support them

 

Approach

Which characters to allow can be handled by appropriate checks in the data type. This could be controlled by set-up switches if desirable (though I would not recommend that).

We must bear in mind that the data model is supposed to be an exchange standard. Scientifically important strings should be in a character set that is guaranteed to work anywhere. That means 7-bit US ASCII. There is no point in allowing chain codes in ancient Japanese if neither the PDB nor any application can handle them.

Non-ascii characters make sense in two contexts: URLS, file names, etc. that refer directly to an outside non-ASCII value, and values that serve as user-readable annotation, such as experiment names, details, annotation attributes etc. So we need two kinds of String type: ASCII-only, and full unicode. Both would be stored as unicode, the issue is which unicode values are allowed.

Conversion between unicode and string is handled either explicitly or through the pythondefaultencoding. The latter can be set at Python startup (only). There are three sensible settings:

  • Encoding same as system encoding. That means that any character that can be typed, copy/pasted or printed can be entered without special escape sequences and behaves the way a user would expect.
  • ASCII only. That guarantees that any additions you make will be readable by everybody, but may break on projects created with other settings. It can still work on projects containing non-ASCII characters, provided that they are not used, at least in a way that causes a conversion error. We should use this encoding when making reference data to prevent compatibility problems. Maybe in addition to making sure that reference data packages contained only ASCII string attributes.
  • Some kind of 'escaped-ascii'. This must represent any unicode character losslessly in ascii-only characters, and can be reasonably readable for latin-like scripts. It allows you to deal with projects that contain characters your system encoding cannot handle, even if rather clumsily. Since the escaped strings are longer than the originals you can not rely on the string type length limits here. This encoding would be used specifically to handle projects that break the other encodings. I have not quite figured out the right way to make that work, but str(repr(<unicode-string>)) will do the job if nothing else

 

Technical details

System encoding

stdin and stdout and stder carry encodings that are used for keyboard entry, printing and other things. They are set from the defaultfilesystemencoding a python startup time and can be controlled through an environment variable called PYTHONIOENCODING (from Python 2.6 onwards). This is the encoding referred to as 'system encoding above. It is the behaviour the user sees.

String/unicode conversion

You need an encoding to convert between unicode and bytes, be it for comaring or adding values, writing or reading.  The python default encoding is used when no explicit encoding is given. The default encoding is set to ascii. It is possible to change this using the sitecustomize.py file. This file is read only when Python starts up, and must be on the pythonpath at this point. We can guarantee this (except for possible crazy cases) by putting it in the ccpn/python directory. We should be able to set it from an environment variable. This is the encoding that must be changed to get the three alternative cases given above.

Specific encoding

We can and should use specific encoding whenever the situation does not depend on the system setting. Internally we should simply use unicode whenever possible. Python source code files should be ascii-only (the default) to prevent unportable characters from creeping in. Files can be given their own encoding declaration (in the first two lines). By declaring all CCPN-exported XML files as UTF-8 we can avoid conversion problems here. Applications that require a specific encoding could be dealt with specifically; possibly we could use encoding as ascii-with-escapes in some cases.

Data Types

Most String attributes should be of data types that allows only ASCII - this can be done by allowing only unicode characters in the range 0-127. Data that need unicode values should have alternative Data Types without this constraint.