The tautomerization models behind the JChem tautomer search

This page describes the tautomerization models used in the JChem tautomer search:

The JChem tautomer search makes the decision if a query and a target molecules are tautomers of each other. It can use two tautomerization models for this: the generic and the normal canonical tautomerization.

To decide the tautomer equivalence, the search algorithm first generates the relevant tautomer forms of the query and the target. Then it makes a graph equivalence check for the generated tautomers. If the two generated tautomer forms are identical , the search considers the query and the target molecules as tautomers.

The following description gives an overview on the generic and normal canonical tautomerization.

Generic tautomers

The generic tautomer represents all theoretically possible tautomer forms of the input molecule. It is generated based on the following algorithm:

All possible H-bond donor and acceptor atoms in the molecule are identified.
These atom sets are filtered using the Maximal Allowed Length of Tautomerization Paths option (default value is 4) resulting in narrower sets of donor and acceptor atoms taking part in tautomerization.
Tautomer regions are identified by finding the longest connected paths of donor and acceptor atoms which have alternating single and double bonds.
The identified regions are converted into a molecular representation by
- replacing the original bonds within the region with ANY bonds and
- attaching the number of bonding electrons, the number of D and T atoms in the region as data string to the region.
  
  The output of this generation process is the generic tautomer form of the input molecule showing the identified distinct tautomer regions.

Normal canonical tautomers

The normal canonical form (compared to the generic) represents a subset of all possible tautomers of the input structure.

The normal canonical forms are generated based on the following algorithm:

All possible H-bond donor and acceptor atoms in the molecule are identified.
These atom sets are filtered using the Maximal Allowed Length of Tautomerization Paths option (default value is 4) AND the built-in tautomerization rules coming from the normal canonical tautomerization model (e.g. aromaticity protection). This step results in narrower sets of donor and acceptor atoms taking part in tautomerization.
All possible tautomer forms are generated using these new donor and acceptor atom sets.
One final normal canonical form is selected from the generated forms using a scoring function.

The output of this generation process is the normal canonical form of the molecule.

Examples

The following examples show how the generic and normal canonical tautomerization behave in the cases of the 5 most common tautomerization types.

Oxo-enol tautomerization

Molecules	Generic tautomers	Normal canonical tautomers

Amine-imine tautomerization

Molecules	Generic tautomers	Normal canonical tautomers

Amide-imide tautomerization

Molecules	Generic tautomers	Normal canonical tautomers

Lactame-lactime tautomerization

Molecules	Generic tautomers	Normal canonical tautomers

Nitroso-oxime tautomerization

In the case of the nitroso-oxime tautomerization the generated generic tautomer forms are the same, while the normal canonical tautomers are different. This shows that both forms are stable and exist in water as distinct compounds.

Molecules	Generic tautomers	Normal canonical tautomers

Counter examples - differences between the two models

The following examples show molecule pairs for which the generic forms are identical but the normal canonical forms are different. This shows that while the generic tautomerization model considers the two forms as a tautomer pair, the normal canonical model does not. This means that the two molecules can be considered as distinct molecules.

Molecules	Generic tautomers	Normal canonical tautomers

Speed

The generic tautomer generation was measures to be 5x faster than the normal canonical generation. These minor speed tests were run on a MacBook Pro (2.7 GHz Intel Core i5, 8GB DDR3).


$ time cxcalc -N ih generictautomer nci_rnd_1000.smiles >nci_rnd_1000_generic.smiles 

real    0m5.225s
user    0m12.194s
sys 0m0.573s


$ time cxcalc -N ih canonicaltautomer --normal nci_rnd_1000.smiles >nci_rnd_1000_n_canonical.smiles 

real    0m25.303s
user    1m9.342s
sys 0m1.683s