Similarity searching finds molecules that are similar to the query structure. The similarity is calculated on the basis of the or fingerprints of the chemical structures to compare. A molecular descriptor is a set of values associated with the molecular structure of the molecule. The term "molecular descriptor" refers to all kinds of structural keys, hashed fingerprints, binary fingerprints, different types of pharmacophore fingerprints and scalar values. There are various metrics available for the calculation.
JChem supports similarity search in files and in databases as well, and provides different hit display options for the visualization of hit sets.
JChem Base product contains two types of built-in fingerprint methods: chemical hashed fingerprints for molecules and reaction fingerprints for reactions; and allows the use of more molecular descriptors/fingerprints as extended connectivity fingerprints, pharmacophore fingerprints, Burden eigenvalue descriptors; and provides the possibility of the application of user-defined custom molecular descriptors.
Similarity search in files runs always on the basis of the built-in chemical hashed fingerprints for molecules and reaction fingerprints for reactions.
Similarity search in database could be run not only on the basis of chemical hashed fingerprints , but also on the basis of other screen methods as well. The possibility of the application of other screen methods strongly depends on the used platform.
JChem Base product contains the following built-in fingerprint-based screen methods available for similarity search:
The Chemical Hashed Fingerprints and Reaction Fingerprints are automatically generated in the JChem tables during the data import process; similarity search uses these generated fingerprints by default.
The built-in fingerprints are generated on-the-fly if similarity search is executed in files.
The following descriptor-based screen methods are available for similarity search:
For the generation of the above listed fingerprints/descriptors, you have to add them to the tables concerned, and let the tables be recalculated. See Administration guide of JChem Manager about how to modify a JChem table by addition molecular descriptors. Molecular descriptors can be generated on the basis of molecular descriptor configuration files. See an example configuration file .
Further information about generation of molecular descriptors in JChem Cartridge can be found in Generic Molecular Descriptor support in JChem Cartridge, and about the GenerateMD application in documentations.
Molecules in low energy conformation serve as input for 3D similarity comparison. 3D conformation may be obtained by ChemAxon's Clean3D tool.
Screen3D evaluates 3D Tanimoto similarity between pairs of molecules by maximizing the intersection of their volumes. The volume may colored by pharmacophoric properties or by atom types. During the calculation the structures are translated and rotated and the rotatable bonds are tweaked. See example here.
Descriptor-based similarity search can be speeded up by caching the descriptor data. See details of caching.
In JChem there is a possibility to apply user-defined custom descriptors/fingerprints. See the details in the Custom descriptor implementation section of the JChem developer's guide, Generic Molecular Descriptor support in JChem Cartridge, and documentations.
Various metrics are provided in JChem to compute the value of similarity or dissimilarity. Some metrics (for example Tanimoto) provide similarity values, some other metrics (for example Euclidean) provide dissimilarity values. The values calculated with the metrics listed in the table below (with the exception of Euclidean) vary from 0 to 1. Similarity (S) value can be calculated from the value of dissimilarity(D): S = 1 - D (with the exception of Euclidean metric).
The larger the value of the dissimilarity coefficient the bigger the difference between the two structures is.
Notes: different molecules may have 0 dissimilarity, if their descriptors are the same.
The table below lists the dissimilarity metrics and their general formulas; the representation of the formulas is based on finite length binary fingerprints.
number of bits set in the fingerprint of molecule A
number of bits set in the fingerprint of molecule B
number of bits set in the fingerprint of both molecules A and B
coefficient representing the weight of properties of molecule A, its value is between 0 and 1
coefficient representing the weight of properties of molecule B, its value is between 0 and 1
Substructure (extreme case of Tversky: α=0, β=1)
molecule B as substructure of molecule A
Superstructure (extreme case of Tversky: α=1, β=0)
molecule B as superstructure of molecule A
Two types of reaction similarity calculations have been introduced: structural and transformational. Structural distinguishes the reactant and the product sides, while transformational relates to three levels of coarseness. With these considerations five metrics need to be introduced to efficiently estimate the five different categories of reaction similarity. These metrics are as follows:
See details of Reaction fingerprint metrics
The visualization of the search results is a very important feature of the JChem products. Hit display options in similarity search involve coloring of maximum common substructures and the selection of what to display (similarity or dissimilarity score, query structure, other labels and boxes). See the description of hit display options available in similarity search here.
Similarity search is performed on the basis of the built-in molecular descriptors ( chemical hashed fingerprints for molecules and reaction fingerprints for reactions) by default. As described above, in case you want similarity search to be performed on the basis of other molecular descriptors/fingerprints you have to add these descriptors to the target tables concerned before the search.
For the execution of similarity search you have to specify:
Example of running similarity search in command line with jcsearch
Table below illustrates similarity search (performed with default search options with the built-in Tanimoto metric and dissimilarity threshold set to 0.9). Similarity score is shown, and the maximum common substructures are highlighted in the hit structures. Similarity search was performed in file with jcsearch:
Similarity score based on CFP/ Hits