Performance Tips

Improving Performance with Large Data Sets

Here are some tips for maximizing performance with large data sets:

Import smiles format files rather than Marvin or SDF, as smiles is much more compact.
Avoid sorting large data sets. Specify a better query to reduce the size first!
Close any unused database connections and restart IJC - databases that have been opened will be using up valuable memory.
Close any unused views of other database tables from the database connection you are using, then restart IJC. Added views can also slow performance by using up memory.
Close all data windows before you shutdown IJC. That way they will not need to be restored when you restart, and startup time will be better.
Use a remote database (Oracle of MySQL) rather than a local database (this requires an Instant JChem license). For information on purchasing a license, please email: sales at chemaxon dot com.
Use Oracle cartridge tables if you are running Oracle and have a license.

Database tuning

Serious databases need serious skills to optimise. This is really beyond scope of these documents, and if you have needs here you should employ a good DBA. But here are a few tips that we have come across:

MySQL

Buffer Pool Size

For large databases you may need to adjust some settings for the MySQL server. Most important of these is probably the Buffer Pool Size parameter for InnoDB tables. This affects the amount of information that is held in memory by MySQL. Ideally all the indexes that are in common use should fit inside this buffer. For large databases (many millions of rows) you may benefit from increasing this parameter, and for a dedicated database server you may want a significant amount of you available RAM set aside to this.

Use cursors

By default MySQL retrieves the complete result set before sending it to the client. It does this to optimise performance by avoiding the need to go back to the database for more data. Most other databases use a cursor, where only a small number of rows are actually retrieved, and to get all the rows means going back to the database for more, which can result in many round trips to the database. Which approach is faster depends on the exact usage and there are no hard and fast rules.

IJC generally breaks data retrieval into small chunks anyway, so the default MySQL approach is usually fine. However there is one exception, and that is retrieval of the hit list when retrieving initial data or running queries. For very large tables (millions of rows) the hit list can be large, and MySQL can take a long time to build it all. In these cases using a cursor can improve performance significantly. MySQL can be told to use a cursor by changing the JDBC parameters. This is done by changing the JDBC URL that is defined as part of the connection parameters (r-click on the IJC schema node in the projects window and choose 'Schema settings...'). Change the JDBC URL parameter and append the appropriate parameter to it. An example is:


jdbc:mysql://mydb.mycompany.com/chembl_06?useCursorFetch=true&defaultFetchSize=1000

The


useCursorFetch=true

parameter tells MySQL to use a server based cursor, and the


defaultFetchSize=1000

parameter tells it to fetch 1000 rows at a time. You may want to experiment with this value to see if higher or lower values work best for you.

See here for more details about the MySQL JDBC driver properties.

If you have any other database tuning tips that you think would be useful then please let us know and we will add them here.

Running Java in server mode

Some versions of Java can operate in a "server" mode which can improve performance, but at the expense of start up time. This is mainly designed for long running server process which start once and then run for a long time.

Tests with Instant JChem have shown that running in server mode can provide a significant performance gain, particularly for structure searches and chemical terms calculations (approximately 20% faster), but this comes at the expense of slower startup time (approx 2-3 times slower).

If you wish to try this you can edit the IJC startup script to use a version of Java that supports server mode and specify the -server flag :

{primary} The version of Java that is installed with IJC does not support server mode

Close Instant JChem
Find where Instant JChem was installed, and find the file [instant-jchem-home]/etc/instantjchem.conf
Make a backup of this file!
Open this file with a text editor
Find the line that defines the 'default_options' property. Edit this line, adding the flag -J-server
Uncomment the line with the jdkhome= definition and edit this to point to a Java 1.6 JDK that supports the server mode (e.g. a standard Java 1.6 JDK).
Save the changes and restart Instant JChem.

Indexes and performance

In the right circumstances adding an index to your database table can provide a significant performance benefit, but if added inappropriately this can actually slow things down. Some indexes are added automatically for you where they are necessary. Others you need to add yourself. Generally IJC performs very well without additional indexes so you should not add any unless you know you have a performance problem that can be solved by adding an index, and you should test the performance with and without the index for a range of common queries to test that adding and index really does provide a benefit. Indexes will typically be of little benefit with small tables. Also, bear in mind that what works for one database type may not work for a different database type e.g. what works for a local Derby database may not work for an Oracle database.

For information on how to add and delete indexes in IJC see the schema editor documentation

1. Situations where adding an index is strongly recommended

In these situations adding an index will always be expected to be of big benefit:

When setting the 'Distinct values' setting for a field. This will dramatically improve the retrieval of these distinct values e.g. for display in the query panel.
When merging data using a field that does not already have an index. Merging needs the value to be looked up for each imported entry, and an index will dramatically speed this up.

2. Situations where adding an index may be beneficial

In these situations adding an index may be of benefit (but be careful to consider the potential downside listed in the next section):

When regularly using a field in a large table for querying or sorting adding an index to the columns used by the field will usually improve performance.

3. Situations where adding an index should be avoided

In these situations adding an index may have a detrimental effect:

When the data in the table is frequently being inserted, updated or deleted. When this happens the index also has to be updated and this takes additional time.
For data where the number of distinct values present is not much less that the total number of rows. e.g. for continuously variable numeric data such as molecular weight.

Covering indexes

The nature of typical searches in IJC warrants a special mention of what are sometimes called 'covering indexes'. Most searches on non-structure fields in IJC end up as looking like this at the database level:


select cd_id from jchem_table where cd_molweight < 300

(we assume a JChem table here for simplicity, and assume that we are querying on the molweight field, but the same applies to many searches). In this type of search there are a very large number of different values for the molweight, almost as many as there are rows. Adding an index to this type of field can make many searches slower not faster, especially for queries that return lots of rows (like the one above). The reason is that the index will be consulted for the values that match, but the table has then to be consulted for the cd_id value, and this results in an excessive amount of IO operations, and leads to worse performance than if the index was not present at all.

A solution that can remedy this situation is to add the cd_id column to the index as a second column. Indexes can be composed of multiple columns, and in this case you would create an index with the cd_molweight column as the first column and the cd_id column as the second column. This is often referred to as a 'covering index'. With this type of index the cd_id values can be obtained directly from the index, and the main table does not need to be consulted, resulted in significantly less IO and much better performance.

However it must be stressed that whilst this solution sometimes works well it should be tested in every case to check that it is a good solution for the typical queries that will be run.

Do not use the 'covering index' approach if the column you are indexing has a small number of distinct values (e.g. a supplier name or id). Only use it when most values in the column are different. For columns having a small number of distinct values try using a normal index with just the single column.