You are here: Home Collaborations Developer Notes Wim Vranken Organisation of chemComp data

Organisation of chemComp data

Describes the organisation of the chemComp reference data in CCPN, and issues related to this. Wim Vranken 2006

The chemComp data in CCPN is organised on two levels: the ChemComp level, where all possible data related to the ChemComp is stored. This data does not necessarily make chemical sense, and describes the superset of information for the chemComp. The actual physical representation of the ChemComp appears on the ChemCompVar level, where variants of the ChemComp are described in full chemical detail.

This reference data fits into the further organisation in Molecules and Chains in MolSystems. Click here for an overview.

For a full model description of the ChemComp package, click here. Note that this information may be out of date with the latest information - to get this you have to download and install the latest CCPN release (or work off the CVS repository).

The core information in the CCPN chemComps was taken from the MSD database, but NMR-specific information, variant information and atom naming system information was later added. The CCPN ChemComp data is only curated up to a certain point - it is useable but definitely not perfect reference information.

In below description, bear in mind that the atoms are objects in the CCPN description, and they derive their real meaning not exactly from their name (which is just a key), but rather from all the other objects they are linked to (e.g. bonds, chemical element, stereochemistry, ...).


ChemComp level

The ChemComps are organised by molecular type (molType) and the internal CCPN code (ccpCode). Examples of this dual key are:

 ('protein','ASP'), ('carbohydrate','Glc'), ('RNA','C'), ('DNA','C'), ('other','ATP')

The ChemComp objects contain information like the name, code3Letter, msdCode, cifCode, code1Letter, commonNames, ... .

The ChemComp level also describes the superset of chemical information for a particular chemical compound. For example, an aspartic acid amino acid contains the following atoms:

H1, H2, H3, H, N, CA, HA, C, O, O', O'', H'', CB, HB2, HB3, CG, OD1, OD2, HD2

Each of these 'ChemAtom' objects also contains information on, for example, their chirality, elementSymbol and whether they are waterExchangeable. The stereochemistry of any arbitrary set of chemAtoms can also be defined (e.g. trigonal-pipyramidal, ...) - although this is currently not populated.

All the bonds between these atoms are then also fully described, with their bond type and bond stereochemistry (if any):

H1-N, H2-N, H3-N, H-N, N-CA, CA-HA, CA-C, C-O, C-O', C-O'', O''-H'', CA-CB, CB-HB2, CB-HB3, CB-CG, CG-OD1, CG-OD2, OD2-HD2

Bond angles and torsion angles are also described on this level.

In addition to normal atoms there are 'linkAtoms' that describe the atoms in the preceding/following residue for linear polymers, so that bonds/angles/torsion angles related to these atoms can be described as well. For an amino acid, these are:

prev_1, prev_2, next_1

So the bonds:

prev_1-prev_2, N-prev_1, C-next_1

are also described.

As such, this superset of information is meaningless in a real physical sense (e.g. the N atom has 5 bonds). Also, no real physical information is stored about, for example, atom coordinates, bond lengths, torsion angles, ... . This is dealt with in other packages that refer back to this one.

ChemCompVar level

This level describes the actual physical forms in which a chemical compound can exist. The only requirement is to select the correct subset of atoms from the superset at the ChemComp level, and all bonds/angles/torsion angles relevant for this ChemCompVar can instantly be derived from the atom subset.


Chemical description

To continue with aspartic acid: if it is at the start of the protein chain, with the N-terminus protonated and the sidechain deprotonated, the following subset of atoms and linkAtoms is picked:

H1, H2, H3, N, CA, HA, C, O, next_1, CB, HB2, HB3, CG, OD1, OD2

This then means that only the following bonds are relevant for the chemCompVar:

H1-N, H2-N, H3-N, N-CA, CA-HA, CA-C, C-O, C-next_1, CA-CB, CB-HB2, CB-HB3, CB-CG, CG-OD1, CG-OD2

The molecular weight of a chemCompVar can be instantly derived from the subset of atoms it contains.


Linking and descriptor

The 'state' of a ChemCompVar is contained in the 'linking' and 'descriptor' attributes. For this particular aspartic acid variant these are:

linking: 'start', descriptor: 'prot:H3;deprot:HD2'

The linking states for a linear polymer can be 'none', 'start', 'middle', 'end' (corresponding to the '_LFOH', '_LSN3', '_LL', and '_LEO2' appendices in the MSD database).

The descriptor contains protonation state information, but can also indicate other bonds formed by the ChemCompVar. Cysteine, for example, can have the following linking/descriptor state:

linking: 'middle', descriptor: 'link:SG'

It also contains the following linkAtom:


which generically describes the atom bound to the SG atom.

Note that the linking information is only contained within the descriptor for linear polymers - for non-linear molecules this information is directly contained within the linking attribute. For example, a glucose carbohydrate can have the following linking/descriptor state:

linking: "link:C1,O3,O4", descriptor: 'neutral'

Generally speaking we try to restrict the superset of information for a particular chemComp to a defined heavy atom frame, with only protons and/or 'easy' leaving groups (like OH in the case of the amino acid main chain) allowed to be removed from the superset of atoms. Heavy atom frames with different stereochemistry (D-Ala and Ala, or alpha-Glc and beta-Glc)  get stored in different chemComps.


Atom subtypes

There is a problem with the above description on a chemical level. In the aspartic acid case again, the CG-OD1 and CG-OD2 bonds have, in principle, a variable bond type depending on the protonation state of the OD2, so using the same OD1 and OD2 atoms for the deprotonated and protonated state is not correct. This problem was solved by introducing the atom 'subtype'. In reality, the 'key' for an atom is the combination of its name and its subtype, so for the carboxy atoms in aspartic acid these would be:

(CG, 1), (OD1, 1), (OD1, 2), (OD2, 1), (OD2, 2), (HD2, 1)

Which then gives the atoms and bonds depending on the protonation state:

deprot: HD2    atoms (CG,1) (OD1, 2) (OD2, 2)
bond (CG, 1)-(OD1, 2) singleplanar
bond (CG, 1)-(OD2, 2) singleplanar
prot: HD2      atoms (CG,1) (OD1, 1) (OD2, 1) (HD2, 1)
bond (CG, 1)-(OD1, 1) double
bond (CG, 1)-(OD2, 1) single
bond (CG, 1)-(HD2, 1) single

This system is also used in cases where there can be ambiguity about atom names or the definition of atom sets (see below). For the HH11 and HH12 atoms for arginine, for example, there is an HH1* 'atomSet' in the prot:HH12 case, and no 'atomSet' in the deprot:HH12 case. The HH11 atom therefore has (HH11, 1) and (HH11, 2) subtypes. Another option in this case would be to use a different atom name (e.g. HH1) for the deprotonated variant.

Note that not all of the subtype information is contained within the current set of chemComps - this will be fully introduced during the next chemComp phase (end of 2006).


Atom sets

Another set of information that is provided with the CCPN chemComps are atomSets. These atomSets describe sets of atoms that are relevant in dealing with the chemComp. For example, an leucine has the following ChemAtomSet objects:

HB*  linked to atoms HB2, HB3            isProchiral
HD1* linked to atoms HD11, HD12, HD13    isEquivalent
HD2* linked to atoms HD21, HD22, HD23    isEquivalent
HD* linked to atoms HD1*, HD2*    isProchiral
H* linked to atoms H1, H2, H3 isEquivalent

This information can then be used for NMR assignments, NMR constraint handling, ... .


Naming systems

Both the ChemAtoms and ChemAtomSets are linked to 'SysNames' objects. These describe the atom names used for the particular atom(Set) in a particular naming system (e.g. 'PDB', 'MSD', 'XPLOR', ...).


What happens on the molecule level?

For an overview of the ChemComp/Molecule/MolSystem organisation discussed below, click here.

Molecules themselves are defined as a set of ChemComps with explicit links. The Molecule does describe the linking and descriptor for every 'molResidue', but the Residues that occur in a Chain in a MolSystems (and any coordinates corresponding to a molecule) can have different descriptors to the molecule. This means that you can describe a sequence, derive a formula molecular weight, and still use the same sequence even if the actual protonation state in some particular context turns out to be different.