You are here: Home V2 Software Software More ... Developer Notes Memops Code Generation Unicode Fix

Unicode Fix

How to solve the unicode / strange character problem.

Problem

  1. In the current code you can enter non-ascii values (e.g. £) in string attributes. These are stored normally, but cause an exception when you try to reload the files, making the project unopenable.
  2. It is impossible to use non-ascii characters in cases where they are necessary, like file paths, or desirable, like comments and annotations. Chinese customers might appreciate a change here.

 

Actions

  • All strings are stored inside the data model as unicode strings, and exported as such.
  • Standard string-type attributes are coerced to unicode on setting, using the default encoding
  • GUI, applications etc. are encouraged to use unicode..
  • XML files are saved and loaded as UTF-8. This involves no conversion problems,
  • Conversion to classic string is done using the default encoding.
  • The default encoding is set to be the same as the file system default encoding, using the sitecustomise.py customization hook. The default encoding will vary from system to system, but will always allow your 'current, normal' character set to be used without problems. All system default encodings support 7-bit US ASCII.
    • This will change default string behavior for any python application that runs when the top CCPN directory is on the Python path.
  • We provide a special 'compatibility' IO mode, where unicode strings incompatible with the current encoding are partially converted to unicode escape sequences on load, and are converted back again on save. This mode can alter data
    1. if the attribute has a maximum length and the length of the escaped string exceeds it.
    2. If a string contains a sequence that happens to match a unicode escape sequence.
  • All strings, except those given below, are constrained to accept only characters representable in 7-bit US ASCII.
    Exceptions are:
    • All fields named 'details' are changed to infinite length and accept any character.
    • All path expressions, urls, and file names (type PathString) are infinite length and accept any character
      • ccpnmr.Analysis.Macro.path is changed to type PathString
    • String values in keyword:value systems are infinite length and allow any character. This comprises
      • Template.MultitypeValue.textValue (used in NmrCalc, Validation, ...)
      • Implementation.AppDataString.value
      • Note that Implementation.StringMatrix.data is not changed and remains limited to ASCII.
    • Implementation.Url.host, changed to infinite length and accept any character
    • Implementation.Url.password, changed to infinite length and accept any character
    • The following would be useful for e.g. Chinese users. As they can not be changed to infinite length, you would risk losing information when running in compatibility mode:
      PLEASE COMMENT:
      • Peak.annotation
      • PeakDim.annotation
      • Resonance.name
    • Can you see any other attributes that should allow full unicode?
      PLEASE COMMENT

 

Comments

  • This should work faultlessly in the following cases
    • All data in a project are in 7-bit ascii
    • The system default encoding allows all unicode. This should be the case for sufficiently new versions of
      • OSX
      • Windows XP and upwards
      • SuSE 9.1 and up, Red Hat 8.0 and up, including the Fedora core series.
    • The encoding is the same as the computer where the project was written.
  • Using default mode should work faultlessly except for
    1. Finite length strings that get truncated where the escape string is too long
    2. Unicode escapes that appear in text by coincidence
  • Even unicode strings that are not representable in the local encoding should mostly pass through without problems. You only get problems if the strings are 1) accessed, 2) passed to a command that forces conversion from unicode to string.
  • We can increase the number of unicode-enabled strings later, but we must be sure that they are never needed by unicode-incapable applications (NB BMRB and PDB are probably unicode-incapable)
  • Changing our sttring type to Unicode should in theory be unproblematic as long as we stick to 1-bit US ASCII. It may be that we can get problems passing unicode to integrated applications.