The chemical hashed fingerprint of a molecule is bit string (a sequence of 0 and 1 digits) that contains information on the structure. Chemical hashed fingerprints are mostly used in the following areas:
At both applications a proper configuration of the fingerprint is very important:
ChemAxon provides the GenerateMD program for generating binary fingerprints that can be processed further. This program can also be applied to fine-tune fingerprint parameters for JChem.
The process of fingerprint generation goes as follows:
Fig. 1 Chemical Hashed Fingerprint generation process
The number of bits in the bit string.
The maximum length of atoms in the linear paths that are considered during the fragmentation of the molecule. (The length of cyclic patterns is limited to a fixed ring size.)
After detecting a pattern, some bits of the bit string are set to "1". The number of bits used to code patterns is constant.
The percentage of 1 digits in the bit string. We consider fingerprints with more ones "darker" than those with less ones.
In addition, it decreases fingerprint darkness, and therefore the probability of bit collisions also, which is beneficial.
A too long bit string may decrease the efficiency of information storage. We found that a length of 512 bits (64 bytes) worked well for small and huge databases as well. However, in similarity calculations longer fingerprints exhibit better performance in terms of selectivity (that is, distinguishing similar but not identical compounds). This is important in similarity based virtual screening as well as in similarity based clustering. In such applications 1024 bits usually provide better results.
The effects are:
Substructure searching performs well with 5-6 long patterns. In similarity searching, however, longer patterns may be required, 7 is usually good value, and path longer than 8 seldom improve results. Also bear in mind that longer paths necessitates longer fingerprint to avoid too dark fingerprints.
The effects are:
Typical value is 2. Database pre-filtering for substructure searching does not require larger bit-count than 2, and this allows shorter fingerprints that is usually beneficial both in terms of storage space requirement and retrieval time. Though higher values could enable better separation of similar but not identical compounds thus leading to less frequent call of atom-by-atom matching in substructure searching, but only with the expense of doubled storage space and thus slower retrieval and more time consuming fingerprint comparison which is significantly more frequent procedure than the atom-by-atom searching.
Again, the situation is somewhat different in similarity searching, yet values higher than 2 rarely increase the amount of information represented by the fingerprint significantly (as the 3rd, 4th etc bits are more correlated with the other two, while 1st and 2nd are highly independent).
To choose optimal parameters for your compounds, running GenerateMD with the --stat option is recommended, or the use of JChem table or index statistics. (See more information in the JChem Manager command line usage (s command), at the Cartridge index statistics function and the Statistics tab at Instant JChem Schema editor.) These tools provide some practical information on the database (average/minimum/maximum "darkness", distribution, etc.).
Maximum darkness should not be higher than 80% (other sources/users say 2/3, ie. 67%). Otherwise, the information content of the individual fingerprint is decreased, and thus in similarity searching, for instance, similar though not identical compounds cannot be distinguished. Even a few too dark fingerprints also decrease screening efficiency at structure searching and consequently atom-by-atom search is unnecessarily often performed on the records with the dark fingerprints, even when target structures do not contain the given query structure.
The average darkness highly depends on the application and the particular data set (e.g. total diversity highly influences fingerprint darkness). In theory the information content is optimal at an average darkness of 50%, though in general, darkness should not exceed 40% to be on the safe side (to avoid frequent collisions).
The following statistics output shows an optimal fingerprint configuration, as generated by JChem table statistics function:
Statistics for table: APP.SCREENING_COLLECTION -------------------- Row count: 58850 NULL SMILES count: 0 Average SMILES length: 40.09 Average compressed SMILES length: 20.34 Markush structure count: 0 (0.0%) Fingerprint settings: Length (bits): 768 Pattern length: 6 Bits set per pattern: 2 Min. CFP darkness: 4.03% cd_id: 26456 Max. CFP darkness: 69.66% cd_id: 20757 Avg. CFP darkness: 33.57% Chemical Fingerpint distribution: -------------------------------- 0% - 10% : 0.6 % 10% - 20% : 8.54 % 20% - 30% : 28.8 % 30% - 40% : 34.92 % 40% - 50% : 21.42 % 50% - 60% : 5.27 % 60% - 70% : 0.42 % 70% - 80% : 0.0 % 80% - 90% : 0.0 % 90% - 100% : 0.0 %
The following graphs show the dependence of