This background material describes the models behind the NMR prediction:
The current version of NMR Predictor employs a combination of two basic methods (similarity search based on HOSE code technology and QSPR modeling) for chemical shift prediction. After a concise introduction to the HOSE code technology, QSPR descriptors and decision tree-based QSPR modeling we will show how the HOSE and QSPR approaches can be merged in order to obtain an accurate and robust chemical shift prediction model.
The HOSE code technology is often used to describe the chemical environment of a selected atom up to a certain radius. Atoms with the same HOSE code are assumed to have similar chemical shift values. The larger the radius of the common HOSE code is the more similar the chemical shifts are. If we have a database containing HOSE codes as keys and corresponding experimental chemical shifts as values we can predict chemical shift values based on similarity search.
In order to reach a better chemical shift prediction accuracy the following clusters have been introduced:
sp2 CH or CH2
protons attached to C
heteroatomic protons (X-H, where X is not C)
Each of these clusters has an M5P decision tree-based chemical shift prediction model. Decisions corresponding to the nodes of the tree are made based on the chemical shift descriptor values until one of the leaves is reached. Each leaf of the decision tree corresponds to a multilinear regression (MLR) model which is employed for the prediction of the chemical shift of the focus atom.
In order to predict chemical shifts accurately, we combined the decision tree-based and HOSE models as follows:
For 1H NMR:
Start with a HOSE radius of 6 and generate HOSE code for the focus atom
If there are shifts corresponding to this HOSE code, return the average them and
If not, go to HOSE radius of 5, ...
The minimal possible HOSE radius is 4
Invoke the M5P-based chemical shift model if there have not been any HOSE hits.
For 13C NMR:
The training and test chemical shift data were obtained from NMRShiftDB, see the link http://nmrshiftdb.nmr.uni-koeln.de/ for further details.
Two kind of descriptor sets were implemented for chemical shift prediction: "sum" (for 1H NMR) and "detailed" (for 13C NMR). Both of them are based on a traversal of the molecule graph starting from a selected atom (focus atom, this is the atom whose chemical shift needs to be predicted). After identifying the neighbors of the focus atom in the different spheres (see the figure below) we count the occurrences of previously defined atom types in the different spheres. The current implementation employs 6 spheres around the focus atom and an additional sphere containing the rest of the atoms. Currently 40 atom types are handled by the descriptor computation. Number of ring closures and hydrogen atoms in a given sphere are added to the 40 atom type counts for each sphere. The same procedure is repeated for atoms belonging to one of the pi-electron systems of the molecule. Thus the total number of the chemical shift descriptors equals 2(6+1)(40+2) = 588. In addition to the descriptors of the sum model, the detailed model also utilizes 8 physicochemical descriptors (valence, period, electronegativity, van der Waals radius, hybridization, bond type to previous atom, number of protons attached, ring closure count) to characterize atoms of the inner (in our case only for the first) spheres. The rest of the spheres are described by the previously outlined method. Thus the detailed model generates 26(40+2)+4*8 = 536 descriptors for13C NMR chemical shift prediction.