The following software versions were used at testing:
Put your valid Chemaxon license file to the /etc/chemaxon/
folder on each node.
sudo service jchem-psql init
sudo service jchem-psql start
Create the file pg_worker_list.conf
in the master node’s PostgreSQL data directory (the directory declared in the postgresql.conf
file as data_directory
) and add the worker’s hostname and PostgreSQL port setups to this file, like:
worker-host1 5432
worker-host2 5421
(more workers)
SELECT * FROM master_get_active_worker_nodes();
In the example below the table is created using hash distribution, it has four shards and has one replica for each shard.
CREATE TABLE mol_table(id int, mol molecule('sample'));
SELECT master_create_distributed_table('mol_table', 'id', 'hash');
SELECT master_create_worker_shards('mol_table', 4, 1);
In a command line shell:
cat -n nci-pubchem_1m_unique.smiles | sed -e 's/^[ ]*//' | sed -e 's/^[0-9]*/&,/' | sed -e 's/[ ]*//g' > nci1m_with_id.smiles
The script below cuts the original CSV file to 64 pieces and imports them to the created table parallely. This script has to be executed from command line as postgres
user.
split -n l/64 nci1m_with_id.smiles chunks/
find chunks/ -type f | xargs -n 1 -P 64 sh -c 'echo $0 `copy_to_distributed_table -C $0 mol_table`'
The Citus 5.1 documentation states that the PostgreSQL COPY
command is also supported and inserts rows into tables parallely, but it failed in our tests.
CREATE INDEX mol_table_index ON mol_table USING chemindex(mol);
SELECT id from mol_table WHERE 'c1ccccc1N' |<| mol;
These limitations are not invoked by JChem PostgreSQL Cartridge.
Import can be done only with a limited set of PostgreSQL methods. Only a single insert can be performed using SQL , bulk insert can be performed with a command-line toolor COPY
, described here, which has a much better performance.
No subselects are allowed in a modification statement (e.g. insert, delete, update). For example INSERT INTO table2 SELECT * FROM table1 WHERE 'C' |<| mol is not supported.
Only distributed tables can be joint in one SELECT statement. A distributed and non-distributed table join is not supported.
Only the postgres
user can have distributed tables.
Distributed tables can not be renamed.
In Citus version 5.0 explain plans are not available, but they are already available in version 5.1.
Since the Chemaxon PostgreSQL Cartridge does not contain an equality operator for Molecule
type, tables can not be distributed by hashing the Molecule
type column. Tables containing molecules have to be sharded by another column.