The tautomerization models behind the JChem tautomer search

    This page describes the tautomerization models used in the JChem tautomer search:

    The JChem tautomer search makes the decision if a query and a target molecules are tautomers of each other. It can use two tautomerization models for this: the generic and the normal canonical tautomerization.

    To decide the tautomer equivalence, the search algorithm first generates the relevant tautomer forms of the query and the target. Then it makes a graph equivalence check for the generated tautomers. If the two generated tautomer forms are identical , the search considers the query and the target molecules as tautomers.

    The following description gives an overview on the generic and normal canonical tautomerization.

    Generic tautomers

    The generic tautomer represents all theoretically possible tautomer forms of the input molecule. It is generated based on the following algorithm:

    • All possible H-bond donor and acceptor atoms in the molecule are identified. images/download/thumbnails/1803184/generic_tautomer_1.png
    • These atom sets are filtered using the Maximal Allowed Length of Tautomerization Paths option (default value is 4) resulting in narrower sets of donor and acceptor atoms taking part in tautomerization. images/download/thumbnails/1803184/generic_tautomer_2.png
    • Tautomer regions are identified by finding the longest connected paths of donor and acceptor atoms which have alternating single and double bonds. images/download/thumbnails/1803184/generic_tautomer_3.png
    • The identified regions are converted into a molecular representation by

      • replacing the original bonds within the region with ANY bonds and

      • attaching the number of bonding electrons, the number of D and T atoms in the region as data string to the region.

        images/download/thumbnails/1803184/generic_tautomer_4.png

        The output of this generation process is the generic tautomer form of the input molecule showing the identified distinct tautomer regions.

    Normal canonical tautomers

    The normal canonical form (compared to the generic) represents a subset of all possible tautomers of the input structure.

    The normal canonical forms are generated based on the following algorithm:

    • All possible H-bond donor and acceptor atoms in the molecule are identified.

    • These atom sets are filtered using the Maximal Allowed Length of Tautomerization Paths option (default value is 4) AND the built-in tautomerization rules coming from the normal canonical tautomerization model (e.g. aromaticity protection). This step results in narrower sets of donor and acceptor atoms taking part in tautomerization.

    • All possible tautomer forms are generated using these new donor and acceptor atom sets.

    • One final normal canonical form is selected from the generated forms using a scoring function.

    The output of this generation process is the normal canonical form of the molecule.

    Examples

    The following examples show how the generic and normal canonical tautomerization behave in the cases of the 5 most common tautomerization types.

    Oxo-enol tautomerization

    Molecules Generic tautomers Normal canonical tautomers
    images/download/thumbnails/1803184/oxo_enol1.png images/download/thumbnails/1803184/oxo_enol_g1.png images/download/thumbnails/1803184/oxo_enol_c1.png
    images/download/thumbnails/1803184/oxo_enol2.png images/download/thumbnails/1803184/oxo_enol_g2.png images/download/thumbnails/1803184/oxo_enol_c2.png
    images/download/thumbnails/1803184/oxo_enol3.png images/download/thumbnails/1803184/oxo_enol_g3.png images/download/thumbnails/1803184/oxo_enol_c3.png
    images/download/thumbnails/1803184/oxo_enol4.png images/download/thumbnails/1803184/oxo_enol_g4.png images/download/thumbnails/1803184/oxo_enol_c4.png
    images/download/thumbnails/1803184/diffs1.png images/download/thumbnails/1803184/diffs_g1.png images/download/thumbnails/1803184/diffs_c1.png

    Amine-imine tautomerization

    Molecules Generic tautomers Normal canonical tautomers
    images/download/thumbnails/1803184/amine_imine1.png images/download/thumbnails/1803184/amine_imine_g1.png images/download/thumbnails/1803184/amine_imine_c1.png
    images/download/thumbnails/1803184/amine_imine2.png images/download/thumbnails/1803184/amine_imine_g2.png images/download/thumbnails/1803184/amine_imine_c2.png
    images/download/thumbnails/1803184/amine_imine4.png images/download/thumbnails/1803184/amine_imine_g4.png images/download/thumbnails/1803184/amine_imine_c4.png
    images/download/thumbnails/1803184/amine_imine5.png images/download/thumbnails/1803184/amine_imine_g5.png images/download/thumbnails/1803184/amine_imine_c5.png

    Amide-imide tautomerization

    Molecules Generic tautomers Normal canonical tautomers
    images/download/thumbnails/1803184/amide_imide1.png images/download/thumbnails/1803184/amide_imide_g1.png images/download/thumbnails/1803184/amide_imide_c1.png
    images/download/thumbnails/1803184/amide_imide2.png images/download/thumbnails/1803184/amide_imide_g2.png images/download/thumbnails/1803184/amide_imide_c2.png

    Lactame-lactime tautomerization

    Molecules Generic tautomers Normal canonical tautomers
    images/download/thumbnails/1803184/lactame_lactime1.png images/download/thumbnails/1803184/lactame_lactime_g1.png images/download/thumbnails/1803184/lactame_lactime_c1.png

    Nitroso-oxime tautomerization

    In the case of the nitroso-oxime tautomerization the generated generic tautomer forms are the same, while the normal canonical tautomers are different. This shows that both forms are stable and exist in water as distinct compounds.

    Molecules Generic tautomers Normal canonical tautomers
    images/download/thumbnails/1803184/nitroso_oxime.png images/download/thumbnails/1803184/nitroso_oxime_g.png images/download/thumbnails/1803184/nitroso_oxime_c.png

    Counter examples - differences between the two models

    The following examples show molecule pairs for which the generic forms are identical but the normal canonical forms are different. This shows that while the generic tautomerization model considers the two forms as a tautomer pair, the normal canonical model does not. This means that the two molecules can be considered as distinct molecules.

    Molecules Generic tautomers Normal canonical tautomers
    images/download/thumbnails/1803184/diffs0.png images/download/thumbnails/1803184/diffs_g0.png images/download/thumbnails/1803184/diffs_c0.png
    images/download/thumbnails/1803184/diffs2.png images/download/thumbnails/1803184/diffs_g2.png images/download/thumbnails/1803184/diffs_c2.png
    images/download/thumbnails/1803184/diffs_nci1.png images/download/thumbnails/1803184/diffs_nci_g1.png images/download/thumbnails/1803184/diffs_nci_c1.png
    images/download/thumbnails/1803184/diffs_nci2.png images/download/thumbnails/1803184/diffs_nci_g2.png images/download/thumbnails/1803184/diffs_nci_c2.png
    images/download/thumbnails/1803184/diffs_nci3.png images/download/thumbnails/1803184/diffs_nci_g3.png images/download/thumbnails/1803184/diffs_nci_c3.png
    images/download/thumbnails/1803184/diffs_nci4.png images/download/thumbnails/1803184/diffs_nci_g4.png images/download/thumbnails/1803184/diffs_nci_c4.png

    Speed

    The generic tautomer generation was measures to be 5x faster than the normal canonical generation. These minor speed tests were run on a MacBook Pro (2.7 GHz Intel Core i5, 8GB DDR3).

    
    $ time cxcalc -N ih generictautomer nci_rnd_1000.smiles >nci_rnd_1000_generic.smiles 
    
    real    0m5.225s
    user    0m12.194s
    sys 0m0.573s
    
    $ time cxcalc -N ih canonicaltautomer --normal nci_rnd_1000.smiles >nci_rnd_1000_n_canonical.smiles 
    
    real    0m25.303s
    user    1m9.342s
    sys 0m1.683s