Comparing libraries with Compr¶
This manual page describes how to compare libraries with the Compr tool:
Introduction¶
Compr compares two sets of objects (like compound libraries) using diversity and dissimilarity calculations.
- Comparing all individual compounds of a set with a library.
- Comparing two libraries.
- Library self-dissimilarity test: comparing all individual compounds of a set with the rest of the compounds.
This document mentions molecules as the entities to be compared, but the software can also be used for other types of objects.
The algorithm applies nearest neighbor searching that finds molecules similar to the query object. The calculation applies the * Tanimoto (or Jaccard) coefficient * that is calculated by the following formula in the case of binary fingerprint (bit string) input:
T (A,B) = N A&B/( N A+ N B- N A&B)
where NA and NB are the number of bits set in the bit strings of molecules A and B, respectively, NA&B is the number of bits that are set in both.
When only binary fingerprints are used for the calculation of the dissimilarity between molecules, then the formula of the dissimilarity of molecule A and B is
D(A,B) = 1- T (A,B)
where T (A,B) is the Tanimoto coefficient for molecule A and B.
When other columns are also used, a weighted Euclidean distance calculation is applied:
D (A,B) = sqrt{[1- T (A,B)] + w 1[ C 1(A)- C 1(B)]2 + w 2[ C 2(A)- C 2(B)]2 + ...}
where
- w 1, w 2, ... are weights
- T (A,B) is the Tanimoto coefficient for molecule A and B
- C i(A) is the value of descriptor i of molecule A.
- sqrt is the square root function.
Instead of the brute force method, Compr applies heuristics to avoid calculating all pairwise dissimilarity calculations and neighbor list comparisons.
Usage¶
You can use Compr by using the following command:
Prepare the usage of the compr script or batch file as described in Preparing the Usage of JChem Batch Files and Shell Scripts, or call the Compare class directly. This can be done in the following ways.
- Win32 / Java 2 (assuming that JChem is installed in c:\jchem):
- Unix / Java 2 (assuming that JChem is installed in /usr/local/jchem):
Because the utility has many parameters, it may be reasonable to create a shell script or a batch file for calling the software.
Options¶
Without a valid license key, the software is in demo mode and maximum 1000 structures can be retrieved from the database.
Loading/Saving of Settings¶
It would be inconvenient to enter all of the parameters of the compr script at each run. To overcome this problem, it is possible to save some of the settings that are not changed frequently in the .jchem file stored in the user's home directory. Use the --saveconf option to store the following settings:
- JDBC driver's class name (
--driver)
- JDBC URL of database (
--dburl)
- Login name (
--login)
- Password (
--password)
- Fingerprint size (
--fingerprint-size)
The settings needed for the database connection are also modified and saved by JChemManager. If you successfully entered into the database using JChemManager, then you don't need to set connection for Compr manually.
Database Connections¶
For more information on setting connection parameters:
- JDBC driver's class name (
--driver)
- JDBC URL of database (
--dburl)
- Login name (
--login)
- Password (
--password)
- Property table (
--proptable)
please visit the Administration Guide of JChem.
Input¶
Two data sources are retrieved, which contain the data of molecule libraries to be compared. The software may import data from either text files (--input) or database tables (--query). Each input data source must contain the following columns:
| Columns | Type | Content |
|---|---|---|
| Id | Integer numbers | Id of compounds (Optional in text files) |
| fp1, fp2, fp3 ... | Integer numbers | Binary fingerprints in integer number blocks The number of fp. columns is fp. length / 32 (Optional) |
| d1, d2, d3, ... | Floating point numbers | Other descriptors (Optional) |
Comments:
- Pharmacophore fingerprints can be generated using theGenerateMD tool. These fingerprints are not binary, so they have to be specified as other descriptors. See an example for combining GenerateMD and Jarp using pharmacophore fingerprints.
- At least one binary fingerprint column or descriptor column is required.
- Use the
--generate-idoption if the id column is missing from the input data.
- Text input files can be created using the GenerateMD application. For example
A sample XML configuration file (
cfp.xml) can be found in theconfigdirectory under theexamplesdirectory.
- In the case of text input, the delimiter between two numbers should be space or tab (comma is not allowed).
- The
cd_idandcd_fpicolumns in JChem's structure tables are appropriate as input.
- In the case of database input, an SQL select statement is needed to retrieve the columns. For example
(For the sake of readability only 6 fp. columns are applied in the above example, but usually this number is much higher.)
Output¶
The software can write the results of the calculation into either a text file (--output) or a database table (--statement).
Each row of the output belong to a compound in the second set. The following symbols are used in the description of columns:
| L 1, L 2 | The compound libraries specified in this order in the call of Compr |
|---|---|
| C | The compound from L 2, which belongs to the specific row |
| D ( C , A i) | The dissimilarity between C and compound i from L 1 |
The exported data contains a subset of the following columns:
| Columns | Type | Content |
|---|---|---|
| Id | Integer numbers | The identifier of C |
| minD | Floating point numbers | The dissimilarity of C and its nearest neighbor from L 1: min( D ( C , A i)) |
| nneib | Integer numbers | The nearest neighbor of C in L 1 |
| simcnt | Integer numbers | The number of objects in L 1, which are similar to C . (Their dissimilarity is lower than or equal to the specified threshold value.) |
| avgD | Floating point numbers | The average dissimilarity of C and all compounds from L 1: sum( D ( C , A i))/ N ( L 1) |
| maxD | Floating point numbers | The maximum dissimilarity between C and all compounds from L 1: max( D ( C , A i)) |
| list_of_similar_objects | Integer numbers in several columns | The list of molecules with a dissimilarity below the threshold (neighbors) in one row. |
Only the column "Id" is printed if the --different-ids option is specified. "avgD" and "maxD" columns are written only if either the --statistics or the --only-statistics option is given. The "list_of_similar_objects" columns are printed if the --list-similar option is specified.
Comments for database output:
-
A precondition of database output is the existence of a database table that contains the above columns. Create the database table before starting the calculation.
Examples for table creation:
- If the result will contain the id values of dissimilar objects (
--different-idsoption is specified)1 2 3 4 5
``` CREATE TABLE compr_result ( cd_id INTEGER NOT NULL PRIMARY KEY) ```
- If the result will contain objects of the second library, which are similar to objects in the first library ( none of
--different-ids,--statistics, and--only-statisticsis specified)1 2 3 4 5 6 7 8 9 10 11
``` CREATE TABLE compr_result ( cd_id INTEGER NOT NULL PRIMARY KEY, minD FLOAT, nneib INTEGER, simcnt INTEGER) ```
- If the result will contain all objects with all details (
--statisticsis specified)1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
``` CREATE TABLE compr_result ( cd_id INTEGER NOT NULL PRIMARY KEY, minD FLOAT, nneib INTEGER, simcnt INTEGER, avgD FLOAT, maxD FLOAT) ```
- If the result will contain the id values of dissimilar objects (
- Before starting the calculation, make sure that the table is empty. The SQL DELETE statement may be applied for deleting the rows in a database table.
Example for deleting all rows:
- In the case of database output, an SQL statement is needed to be specified for Compr (-a option), which inserts the rows containing the results.
Example:
The "?" symbols will be substituted with the corresponding values.
- If the table is filled with the results, the rows may be retrieved using SQL SELECT statements.
Example:
- It is important to place the import statement between quotes because it contains spaces.
- In the case of database output, don't use the
--list-similaroption, because in then the number of columns is not fix.
Diversity Statistics¶
Optionally, Compr can print diversity statistics into the standard output or the given output file. The parameters that enable statistics printing are --statistics or --only-statistics. (The latter one doesn't allow to print information on individual compounds.) The following data will be printed:
- Number of objects in set 1
- Number of objects in set 2
-
Diversity measures:
- minimum dissimilarity between sets
- average dissimilarity between sets
- maximum dissimilarity between sets
The calculation is significantly slower if statistics is enabled, since all pairwise dissimilarity values have to be calculated. (Heuristics cannot be applied.)
Comments on Some Parameters¶
The number of binary fingerprint columns multiplied by 32 (because the bit-length of integer numbers is 32 in Java)
--dimensions
Specifies the number of other columns. If only binary fingerprints are used in the clustering process, then this parameter doesn't have to be set.
--weights
When other columns are used, a weighted Euclidean distance calculation may be applied. If there are also binary fingerprint columns, weights are relative to the Tanimoto coefficient calculated from the binary fingerprints (the Tanimoto coefficient has a weight of 1.0).
--threshold
Compounds with a dissimilarity below the threshold will be considered similar.
--different-ids
Compounds with the same id will not be compared. This is useful in any of the following cases:
- The two compound sets are not disjunct (some of the compounds are the same).
- If self dissimilarity is tested, that is, when the two compound sets are the same.
The precondition to use this option is that the id values of the same compounds are the same.
By default, the heap size in some Java runtime environments is limited to 64MB, so you may run out of memory easily. See the FAQ on increasing the heap size.
Examples¶
In the examples it is supposed that, when needed, all connection parameters are set and stored by JChemManager (or a previous saving by Compr).
-
A batch file (Windows) for reading from a database and writing the id of all compounds in the
commercial_librarytable that are similar to the structures in thehome_librarytable to the standard output: -
A UNIX shell script for reading binary fingerprints from two database tables (
home_libraryandcommercial_library) and insert the results into another table. It collects id-s and calculates dissimilarity information on structures in thecommercial_library, which are similar to some structures in thehome_library:Make sure that the
similtable exists and is empty before running the script. -
Full statistics calculation using the output of GenerateMD (generated id values are needed):
Example for the result file:
-
Self-dissimilarity test of the home library:
-
Displaying the structures and the results of a full statistics calculation using the CreateView and MarvinView applications:
- Creating an SDfile containing the calculated data (the minD, avgD, and simcnt columns) and the structures:
1 2 3 4
``` crview -i id -d "minD:avgD:simcnt" -s commercial_library.sdf -t compr_result.tab > compr_result.sdf ```
- Creating an SDfile containing the calculated data (the minD, avgD, and simcnt columns) and the structures:
