If you think your experimental data can improve the accuracy of the pKa calculation, you can take advantage of a supervised pKa learning method that is built into the pKa plugin. Special structural parts can have an effect on the pKa values calculated by the built-in method, so your correction library based on your experimental data can help the pKa plugin increase the prediction accuracy.
Inaccurately predicted ionization centers need to be identified and experimental data for them have to be collected in order to handle them. Since the learning algorithm is based on linear regression analysis, you need to collect as much experimental pKa data as possible to get enough correlation. There are no hard-and-fast rules about the amount of data to be applied. If your are to create a local model only for a certain type of ionization centers, then it may be enough to collect a few representative structures. A robust model, however, requires as many diverse structures and pKa values as possible.
The experimental data should be collected in an SD file. Then the training command has to be run in order to create a correction library. This will be stored on your local computer, in your user folder.
Preparing the input file
To create a training library a proper input file in SDF or MRV format should be prepared first. This file can be compiled using either Instant JChem or JChem for Excel.
The SD file should contain the following pieces of information:
Structure of the molecule
pKa value 1 (field name: pKa1)
ID of the atom which has the pKa1 value (field name: ID1). It can be viewed by checking the Atom number option in MarvinView (View > Misc menu).
Additional fields of pKa values are optional (recommended for handling multiprotic compounds). For example pKa value 2 (pKa2), ID2, etc.
Definition of only one pKa value is enough to apply the training data, but more values in case of multiprotic compounds will enhance the reliability of the pKa training.
Fig. 1 Input for training library generation
Creating the training library
The training library can be created using the cxtrain command line tool from an input structural file:
cxtrain pka -i [library name] [training file]
cxtrain pka -i mypka mydata.sdf
Applying the training library
Once the training library is generated, it can be applied in different ChemAxon tools for training.
To apply the pre-generated training library in MarvinSketch, see the following steps:
Select MarvinSketch menu Tools > Protonation > pKa.
Set the Use correction library option to activate the training option (see figure below).
If you have created multiple training sets, choose the most accurate one from the dropdown list below the checkbox.
Fig. 2 Using the generated training library in MarvinSketch
The following figure shows the results with (I) and without (II) applying the correction library.
I. pKa calculation with training data
II. pKa calculation without training data
To include your correction library in the pKa calculation use the parameter --correctionlibrary or its short form -L :
cxcalc pKa --correctionlibrary [library name] [input file/string]
If you use cxcalc pKa calculation without the correction library, the results will be calculated with the built-in dataset.
cxcalc pKa --correctionlibrary mypka "CSC1=NC2=C(N1)C=NC(O)=N2"
id apKa1 apKa2 bpKa1 bpKa2 atoms
1 11.19 16.01 2.34 -2.59 7,11,9,4
cxcalc pKa "CSC1=NC2=C(N1)C=NC(O)=N2"
id apKa1 apKa2 bpKa1 bpKa2 atoms
1 8.34 16.01 2.34 -2.59 7,11,9,4
Chemical Terms are available from Chemical Terms Evaluator or from Instant JChem. Evaluator is designed to evaluate Chemical Terms expressions on molecules. Your correction library can be applied as follows:
evaluate -e "pKa('correctionlibrary:[library name]')" "[input file/string]"
evaluate -e "pKa('correctionlibrary:mypka')" "CSC1=NC2=C(N1)C=NC(O)=N2"
Choose the 'New Chemical Terms Field icon' and type the chemical term into the window, use the correctionlibrary:[library name] parameter. Do not forget to adjust the Name, the Type and the DB Column Name.
The following picture demonstrates the usage of pKa training in the 'New Chemical terms' window. The expression
defines that the plugin use the correction library named mypKa, and it calculates the strongest acidic pKa of the molecule(s).
Fig. 3 New Chemical terms window showing the options to be set for pKa training
The results of this calculation are shown in the figure below, with the untrained (Strongest acidic pKa column) and trained (Trained strongest acidic pKa column) pKa values.
Fig. 4 JChem table showing the trained and untrained pKa values