Training the pKa Plugin

    This manual gives you a walk-through on how to train the pKa Plugin:


    If you think your experimental data can improve the accuracy of the p K a calculation, you can take advantage of a supervised p K a learning method that is built into the p K a plugin. Special structural parts can have an effect on the p K a values calculated by the built-in method, so your correction library based on your experimental data can help the p K a plugin increase the prediction accuracy.

    Inaccurately predicted ionization centers need to be identified and experimental data for them have to be collected in order to handle them. Since the learning algorithm is based on linear regression analysis, you need to collect as much experimental p K a data as possible to get enough correlation. There are no hard-and-fast rules about the amount of data to be applied. If your are to create a local model only for a certain type of ionization centers, then it may be enough to collect a few representative structures. A robust model, however, requires as many diverse structures and p K a values as possible.

    The experimental data should be collected in an SD file. Then the training command has to be run in order to create a correction library. This will be stored on your local computer, in your user folder.

    Finally, this correction library can be applied via MarvinSketch, Chemical Terms or cxcalc calculator functions command line tool.

    Training steps

    Preparing the input file

    To create a training library a proper input file in SDF or MRV format should be prepared first. This file can be compiled using either Instant JChem or JChem for Excel.

    The SD file should contain the following pieces of information:

    • Structure of the molecule

    • p K a value 1 (field name: pKa1)

    • ID of the atom which has the pKa1 value (field name: ID1). It can be viewed by checking the Atom number option in MarvinView ( View > Misc menu).

    • Additional fields of p K a values are optional (recommended for handling multiprotic compounds). For example p K a value 2 (pKa2), ID2, etc.

    • Definition of only one p K a value is enough to apply the training data, but more values in case of multiprotic compounds will enhance the reliability of the p K a training.

    A sample of a typical training set is shown in the picture (pKa_trainingset.sdf). ID1 is the index of the atom with the experimental p K a1 value.


    Fig. 1 Input for training library generation

    Creating the training library

    The training library can be created using the cxtrain command line tool from an input structural file:

    cxtrain pka -i [library name] [training file] 

    Example :

    cxtrain pka -i mypka mydata.sdf

    Applying the training library

    Once the training library is generated, it can be applied in different ChemAxon tools for training.


    To apply the pre-generated training library in MarvinSketch, see the following steps:

    1. Select MarvinSketch menu Tools > Protonation > pKa .

    2. Set the Use correction library option to activate the training option (see figure below).

    3. If you have created multiple training sets, choose the most accurate one from the dropdown list below the checkbox.


    Fig. 2 Using the generated training library in MarvinSketch

    The following figure shows the results with (I) and without (II) applying the correction library.

    images/download/attachments/5314199/trained.png images/download/attachments/5314199/no_training.png
    I. p K a calculation with training data II. p K a calculation without training data


    To include your correction library in the pKa calculation use the parameter --correctionlibrary or its short form -L :

    cxcalc pKa `--correctionlibrary` [library name] [input file/string]

    {info} If you use cxcalc pKa calculation without the correction library, the results will be calculated with the built-in dataset.

    Example #1:

    cxcalc pKa --correctionlibrary mypka "CSC1=NC2=C(N1)C=NC(O)=N2"


     id apKa1 apKa2 bpKa1 bpKa2 atoms
     1 11.19 16.01 2.34 -2.59 7,11,9,4

    Example #2

    cxcalc pKa "CSC1=NC2=C(N1)C=NC(O)=N2"


     id apKa1 apKa2 bpKa1 bpKa2 atoms
     1 8.34 16.01 2.34 -2.59 7,11,9,4

    Chemical Terms

    Chemical Terms are available from Chemical Terms Evaluator or from Instant JChem. Evaluator is designed to evaluate Chemical Terms expressions on molecules. Your correction library can be applied as follows:

    `evaluate -e "pKa('correctionlibrary:[library name]')" "[input file/string]" `


    evaluate -e "pKa('correctionlibrary:mypka')" "CSC1=NC2=C(N1)C=NC(O)=N2"


    `;;;-2,59;;;11,19;;2,34;;16,01; `

    Instant JChem

    Choose the 'New Chemical Terms Field icon' and type the chemical term into the window, use the correctionlibrary:[library name] parameter. Do not forget to adjust the Name , the Type and the DB Column Name .


    The following picture demonstrates the usage of p K a training in the 'New Chemical terms' window. The expression

    pKa('correctionlibrary:mypKa type:acidic','1') 

    defines that the plugin use the correction library named mypKa , and it calculates the strongest acidic p K a of the molecule(s).


    Fig. 3 New Chemical terms window showing the options to be set for pKa training

    The results of this calculation are shown in the figure below, with the untrained ( Strongest acidic pKa column) and trained ( Trained strongest acidic pKa column) p K a values.


    Fig. 4 JChem table showing the trained and untrained pKa values