You are here: Home V2 Software Software More ... Developer Notes Data Model New Implementation package draft.

New Implementation package draft.

A proposal for the new Implementation package, with reasons and explanations. IMPLEMENTED as variant. Rasmus Fogh 2007

A proposal for the new Implementation package, with reasons and explanations

Description
Justification
Alternatives


NB A new alternative proposal has been added at the bottom of the file

 

Description

The package would look as follows:

Implementation Package Proposal draft - June 2006


I have taken some liberties with the UML rules. The diagram shows inheritance for both the file version (green) and database version (blue) of the Implementation package, but in any given implementation one of the two sets would be removed. If we need to treat Hibernate as a special case, we could have a third alternative. DbTopObject might have some content, I just do not know what it would be. Note that the MemopsRoot would  be a FileTopObject also in the database implementation. The Nmr and ChemComp classes are examples of topobjects - every package would have a topobject with a parent link and a 'current' link froom MemopsRoot. The 'current' link would point to the last accessed TopObject for the package and would be stored in the Implementation package.

In parenthesis, a 'file-based' implementation can actually store some of its data in a database. The crucial point is that data are loaded into(and saved from) memory one package extent at at time, and all modifications are done in memory. A 'database-based' implementation, on the other hand, continuously keeps its object synchronised with the contents of a database.

The Repository class represents either a directory of a database. Note that you could in theory use a mixture of databases and files for the 'file-based' (or load-by-package) implementation.

The LocationInfo Class could have an object for every package, but might replace several (or all) of them with a LocationInfo marked 'default'. Every LocationInfo would give a search path to look for data to load and single backup location. In a database implementation you could have a single LocationInfo object (default) and a single Repository Object.

There would be a class subclassing AbstractTopObject in every package, and all clases in the package would be (grand)children of this class. This is a new requirement, and somewhat controversial to the database people. The 'dump' and 'restore' functions would dump or restore the TopObject and its children to the backup location (if set) or the general backup location for the package type (found from the LocationInfo object). This allows you to do limited dumps (typically in XML) without messing around with the storage pointers.
AbstractTopObjects would have a guid (a generated permanent universal identifier) as mainkey. They would also have a 'pseudokey', which would be unique and work like the current key except that it would be changeable. The implementation would be set up so that you could get a given AbstractTopObject from either the guid or the pseudokey without knowing where on the Repository search path it was stored. For DB implementations this is obvious - for file implementations it requires you to use the pseudokey in the file names, and to store an external guid<->pseudokey mapping as part of the storage.
It would be illegal to link two objects from the same package that belonged under different TopObjects (unless the links were derived). This would divide the data for each package in different extents that were isolated from each other. There need be no limits on links between objects in different packages. We are, however, considering, to add links between the TopObjects so that you could enforce e.g. that  peaks from a given NMR extent could be linked only to a few specific MolSystems

Storing the Implementation package

Implementation Package Storage draft - June 2006


Most of the Implementation package is takebn up by anstract classes and datatypes. The only objects that actually need to be stored are shown on the diagram above. These would always be stored in a text file, and would not have any place in the database. Note that the AbstractMemopsRoot will be identical to the readable/writableVersion. In a Databas eimplementation you need have only five objects - a MemopsRoot, a Repository and LocationInfo for the Implementation file itself, and a Repository and LocationInfo for everything else (default) that would point to the database. If it is preferred not to  use this mechanism to store  the database access information, you could create the necessary objects on the fly every time and simply not store them.

DataTypes

New ImplementationPackageProposal - June 2006

The Implementation package woudl also contain data types. The diagram shows a subset of the necessary primitive data types, and two of the new DataObjTypes that would be neded in the Implementation package and elsewhere.

Justification


Namespacing

Having a single object at the top of every package means there will be only a single link from MermopsRoot to each package. This avoids name clashes in links or factory functions from top level classes that happen to have the same name (like Nmr.Experiment and Experiment.Experiment).

Control of Extents

In a number of packages you need to divide the data into separate extents with no links between them. An example is the Nmr package. There has to be a TopObject to track which objects belong to which extent, even for packages like Nmr where no natural TopObject has existed so far. It also makes sense to control e.g. which MolSystems are allowed for a given Nmr project. Although it is not yet decided how to handle this in detail, the TopObjects would be necessary in order to implement it.
The extent mechanism is a bit rigid, but it could have various uses. For instance it would allow you to provide rigid separation between e.g. samples belonging to different organisations btu sharing the same database.

Merges and GUIDs

The system of object keys makes it hard to import and merge large blocks of data. You may have to change the unchangeable keys, and if you do you lose track of which object you are dealing with. Providing globally unique permanent IDs for all objects would be one solution, but it would be rather cumbersome. If all objects are within a TopObject, you can give the TopObject a guid and obtain many of the same advantages. Where you do have several TopObjects you can import the entire object tree, change the pseudokey to fit your local namespace, and still have the guid to track which objects are which. Even if you want to have only a single TopObject in daily use, having multiple extents allows you to import an entire block of data from a different database without name clashes or problems with uniqueness. You can then do statistics over the different extents, and can keep teh objects in teh same database while considering if and how you want to copy them across to a different TopObject.

Cleaning up the model

The old Implementation package needed a storage object for every extent, e.g. for every ChemComp. This required a lot of gymnastics when keeping track of several thousand ChemComps. There was also 'HeadObjects' that did nothing except point to the storages, which made for pointless extra links and exra tables. The new Implementation package may well have fewer pointless objects than the old one.

File Implementations

File implementations do need to store the information now given in the FileTopObject class once for every file, so there must be an object to store it in. Previously this was done in the Storage class, which was not on the parent-child path from MemopsRoot to the actual objects. Handling of back-up and file search paths can also be done rather better in the new system

Alternatives

The problems come, of course, for packages that do not have a natural TopObject, or rather, that do not have a natural division in extents. One solution would be to add an essentially empty class, and live with the fact that it was a bit superfluous.

If you decided that certain packages could only ever have one extent, you would lose the advantages of extent control, and the possibility of dumping and merging blocks of objects that belong inder a single TopObject. TopObjects or something like them are indispensable in file-based implementations, but you might want them implemented so as to be invisible in DB-based implementations. This woud require having the links and factory functions go directly from MemopsRoot to the actual objects (e.g. Samples) in your packages, so you would either lose the advantage of namespacing or (more realistically) have to rename your links. Take as an example, the class Target.Project. In the old model you accessed as memopsRoot.getProjects(). Renaming it to protect the namespaces would make it MemopsRoot.getTargetProjects(), whereas the proposed new model might require MemopsRoot.getCurrentTarget().getProjects().
Finally, it is not clear whether you could still have the GUID on the topmost objects and how that would be handled - certainly the heterogeneity of the system would complicate matters.

Still, it might be possible to avoid the extra TopObjects. You could have two kinds of packages. Those without TopObjects would have the topmost links going to MemopsRoot directly, and would have a one-way link to a kind of Storage object instead of to the TopObject. This would preserve interoperability between file and DB implementations.The question is whether not adding those classes is worth the greater complexity of the model and coding machinery, the reduced capabilities, and the less uniform structure of the API.

Alternative diagram - October 2006

The diagram below should resolve some of the practical problems of the previous proposal. The general justification and advantages remain.

NewImplPackageOct

ImplementationObject is the superclass for all non-abstract classes in the Implementation package (and nothing elsewhere). MemopsDataObject is the superclass for all non-abstract classes outside the Implementation package (and nothing inside), while AbstractTopObject is the superclass for all TopObject classes except for MemopsRoot. All objects have a topObject link, and all topObjects have a repository link.

Compared to the previous proposal we now have:

  • The Access control link does not refer to Implementation objects, as indeed it should not.  GOOD
  • ApplicationData are specified in only one place. GOOD
  • The TopObject.repository links now appear in two places, which should have no effect on the code. NO WORSE
  • All objects still have a topObject link, but the topObject link now appears on both ImplementationObject and MemopsDataObject, with different valueTypes. The practical effect is that you need different typing for this link in the Implementation package and other packages. In a DB implementation the two kinds of TopObjects have nothing in common anyway. In a file implementation the autogenerated code will have no problems, at most there may be a need for casting in some places in some languages. At a pinch you could always type TopObjects as MemopsObject and cast as appropriate. UNFORTUNATE
  • The hack for distinguishing between DB and file implementation is different. FileTopObject and DbTopObject are not implemented as classes in any API. Instead their contents are merged with that of their 'subtypes' before code generation. The three generalisations marked in green (which strictly speaking are cases of multiple inheritance) are used only as a shorthand for this merging. MemopsRoot is always merged with FileTopObject, while AbstractTopObject is merged with either FileTopObject or DbTopObject depending on implementation. This works as long as neither DbTopObject nor FileTopObject takes part in any link. Strictly speaking MemopsRoot does not need the 'load' and 'unload' functions, but we would need to introduce an extra class to handle that properly (maybe we should). NO WORSE

In a Database Implementation, only MemopsDataObject (optional) and AbstractTopObject would correspond to tables in the database. The contents of MemopsObject and DbTopObject would be merged into those tables. The objects in the Implementation package would exist in memory and would be saved to disk in XML. Alternatively you could  always create them on the fly. The implementation permit you to share your data between multiple databases, with alternative lookup paths for different packages. If you prefer to enforce that all data are within one database you could maybe simplify the practical use, if not the UML.