You are here: Home V2 Software Software More ... Developer Notes CcpNmr FormatExchange FormatExchange, July 6, 2011

FormatExchange, July 6, 2011

More proposals for FormatConverter

Program flow

On reading the program should first determine the Residue Codes and map the to the official ccpCodes. For each ccpCode it should determine the atom names and map them to the official atom names (this could help determine the types as well). It should then determine the sequence. If there is a sequence read in already, it should map the two sequences to each other.  The program should then perform LinkResonances automatically from these data. Only at this point should the data be validated and displayed. People can then edit the general mappings (as discussed here) and re-run LinkResonances - this will reset manual assignment changes to shifts and peaks, so we should probably only re-map for the records that have actually changed. Finally they can edit individual shifts and peaks manually.

 

LinkResonances

Data Structures


The internal data structures and the storage in the XML are independent of the user interface (and of each other). For the best results they should reflect the built-in constraints on our data. That reduces the need for editing and the scope for mistakes. We do not need to be as complex as the CCPN data model, which was designed for a different situation.

I think we should build our thinking here around the Assignment. An 'Assignment' is a combination of chainCode, resNum, resCode, atomCode, and seqOffset. For most cases we can ignore teh seq offset part in the user interface. I think we have the following constraints on the data:

  • A given resNum can correspond to only one resCode
  • Each LinkResonance corresponds to one Assignment. All shifts, peakDims etc. with the same Assignment - and only those - are linked to the LinkResonance.
  • An Assignment that matches with one or more atoms (stereospecifically, ambiguously, or whatever) is linked to those atoms.
  • Assignments that do not match one or more atoms cannot be linked to atoms, but can be linked to shifts and peaks.
  • Assignments with the same chainCode and resNum (and only those) belong to the same spin system; if they match an existing residue, the spin system can be linked to that residue.
  • In each chemshiftist there can be at most one shift with a given Assignment.

It is possible represent data with these constraints without any explicit LinkResonance record; you just say that each Assignment corresponds to a future Resonance and put additional data in the shifts. With more than one chemshiftlist it is probably not the best way to do things, though.

Instead we should do as follows, I think - a draft of the the equivalent XML can be found here.

We have a LinkResonance, which contains the Assignment, the nucleus, and the resOffset.

shifts and peakdims have a Linkresonance ID that connects them to the LinkResonance.

shifts and peakDims store the original Assignments (as read from the files), while the Assignment to use is stored *only* in the LinkResonance, to guarantee consistency. For links to atoms we can use the chainCode and resNum from the Assignment. The precise set of atoms could be deduced from the atomName (for e.g. valine methyl groups  we would have e.g. HGa*, HGb*, HG1*, HG2*, and HG*) Alternatively we could have a separate list of atom names and a text field to distinguish stereospecific from ambiguous assignments. In either case we would need to calculate names and sets of atoms from each other, so that they are guaranteed to be consistent.

User Interface

The Chemical Shifts table gets three new columns: 'Original Assignment', 'Residue Offset' and 'Isotope'. Each can be turned on and off with a toggle ('Show Original Assignments' etc.). The columns should be turned on automatically if they are relevant, but could be off by default. 'Residue Offset' will probably be 0 (and column off) in most cases, but 'Original Assignment' might actually be better on. The 'Residue Code' column should not be editable, that should be done in Molecules tab. Editing the Asignment for a shift should change the current assignment only for that shift. If other records share the same LinkResonance they should be disconnected from it when it is modified. We might consider a column with the number of peaks that contain the same Assignment. Editing the isotope or sequence offset must of necessity affect the LinkResonance as it is, which will have consequences in other shiftlists, and peaklists.
There should be an extra Shift list, called 'Peaks Only' or something similar, where we put Assignments that are found in peaks but are not in any ShiftList. Here we would put calculated mean shifts and standard deviations. Alternatively we could scall this shift list 'Peak Assignments' and put all Assignments found for any peak.

The Peak List Details table should be unchanged (no room for more). It should show the actual rather than the original assignment. The Edit Peak Assignment popup should have an extra column with 'Original Assignment'.

There would be no need for a 'Resonance Links' tab. The Atoms corresponding to a given Assignment should be calculated automatically.

 

Molecules, Atoms and Coordinates

Mappings

As discussed here we need to have editable maps of residueCodes to ccpCodes, atom names to official atom names (globally, by residue code, and for unmapped residue codes), and input sequence to official sequence. We would need different maps for reading NMR assignment data and coordinates. The tables would have to be there like the other tabs, so that warnings and errors can show up. In the fortunately common case where they are all straightforward they will all be green and can be easily skipped.

User Interface

The Molecules tab should be used only for sequences, and a separate tab should be used for coordinates. In Molecules we would have Chain Code, Residue Number, Residue Code (ccpCode, please) and an 'Original Data' column that could be toggled off. There is no need to edit the Atom codes - they follow directly from the sequence. If desired the atom names can be put in a single column as a comma-separated list. One might want to edit disulfide links and protonation states at some point.

Reading Coordinates would need a special tab. We would need the same set up with mappings for residue codes, atom names, and sequence as for assignments. We can not assume that the correct sequence is always read with the coordinates (I just had a case where the PDB file had missing loops). Most coordinate reading would probably be handled by the mappings. I would propose a Coordinates tab with columns for chainCode, resNum, resCode, atomName, and an Original Data column that can be turned off. We should probably show the atom number from the file, but it should not be editable. We can choose whether to show the actual coordinates, B factors, and occupancies, but maybe they need not be editable; this is a data exchange tool, not a data editor.

 

Problems

We have only the integer resNum to distinguish residues. That may trouble us in two ways.

  • There are no insertion codes in NMR assignment, but they may appear in PDB files. The data model supports keeping the original resonance numbering and insertion code (with an alternate seqId that starts at one). Currently we cannot preserve the authors sequence numbering if they use insertion codes. We could change that, but it might not be worth the hassle.
  • We also use the resNum as the only identifier for unidentified spin systems. If we ever get something assigned to 'impurity2 Ala, HB*' we would have to convert 'impurity2' to a random integer. Probaly this problem is too exotic to care about.