How To¶

Run Queries¶

The Downstream database schemas are available for any database operations: query, insert, update and delete,
This can be performed by direct SQL interactions or using a separate Data Viewer/Database Client application, like Chemaxon's Plexus Connect or Instant JChem applications.
This section shows commonly used sample PostgreSQL queries which can be run interactively or as part of a wider integration.

Search for a specific registry ID (PCN, CN, or LN):¶

-- Search by Parent ID (PCN)
SELECT * FROM parent WHERE pcn = 'target_parent_id';

-- Search by Version ID (CN)
SELECT * FROM version WHERE cn = 'target_version_id';

-- Search by Preparation ID (LN)
SELECT * FROM preparation WHERE ln = 'target_preparation_id';

-- Search across all levels
SELECT
    'parent' AS level,
    p.pcn,
    p.init_date,
    p.mod_date
FROM parent p
WHERE p.pcn = 'target_id'
UNION ALL
SELECT
    'version' AS level,
    v.cn,
    v.init_date,
    v.mod_date
FROM version v
WHERE v.cn = 'target_id'
UNION ALL
SELECT
    'preparation' AS level,
    prep.ln,
    prep.init_date,
    prep.mod_date
FROM preparation prep
WHERE prep.ln = 'target_id';

Search for Job status¶

SELECT * FROM http_message_job ORDER BY id DESC LIMIT 5;

Finding CompReg changes within a time window¶

To find changes within a specific time period, query the modification date fields:

  SELECT 
      p.parent_id,
      p.mod_date AS parent_modified,
      v.version_id,
      v.mod_date AS version_modified,
      prep.preparation_id,
      prep.mod_date AS preparation_modified
  FROM parent p
  LEFT JOIN version v ON v.parent_id = p.parent_id
  LEFT JOIN preparation prep ON prep.version_id = v.version_id
  WHERE 
      p.mod_date >= CURRENT_TIMESTAMP - INTERVAL '30 days' OR
      v.mod_date >= CURRENT_TIMESTAMP - INTERVAL '30 days' OR
      prep.mod_date >= CURRENT_TIMESTAMP - INTERVAL '30 days';

Complementing SQL queries with supplementary data¶

An Entity-Attribute-Value (EAV) model can be used for storing key:value data (ad_name: ad_value) for a given molecule with its associated parent, version, and/or preparation ID in the ADDITIONAL_DATA table. This data can be joined with any table to supplement queries.

  SELECT 
      p.parent_id,
      v.version_id,
      prep.preparation_id,
      ad.ad_name,
      ad.ad_value
  FROM parent p
  JOIN version v ON v.parent_id = p.parent_id
  JOIN preparation prep ON prep.version_id = v.version_id
  LEFT JOIN additional_data ad ON (
      ad.parent_id = p.parent_id OR 
      ad.version_id = v.version_id OR 
      ad.preparation_id = prep.preparation_id
  )
  WHERE ad.ad_name = 'comment';

This flexible approach to dynamic data storing can easily be pivoted, if needed, to provide a more accessible, standard relational table format.

Search for Version-Salt relationship¶

SELECT * FROM http_message_job ORDER BY id DESC LIMIT 5;
SELECT v.cn, s.name, vs.multiplicity
FROM version v
JOIN version_saltsolvate vs ON v.version_id = vs.version_id
JOIN saltsolvate s ON vs.saltsolvate_id = s.saltsolvate_id;

Perform Chemical Structural Searches¶

Chemical structure searching of downstream data is possible by simply configuring the DSClient to use a database with JChem Cartridge installed.

Comprehensive installation guides and API documentation are available online for Oracle, Postgres and other database systems.

JChem PostgreSQL Cartridge (JPC) Overview¶

The DSClient structure table stores molecules in the text column cd_structure.
Chemical structure searches can only be made when values are cast to a MOLECULE type, e.g., MOLECULE('sample'). Chemical indexes can only be performed on MOLECULE type columns.

MOLECULE formats

JPC accepts SMILES, CXSMILES, SMARTS, Molfile, SDF, MRV, etc. Strings are auto-recognized but can also be explicitly defined, e.g., molecule(string, 'smiles').

JPC structure search operators¶

Operators work on MOLECULE-typed columns or expressions. Cast strings with ::MOLECULE('sample') (or your molecule type).

Search type	Operator
Substructure	`\|<\|`
Superstructure	`\|>\|`
Duplicate (exact)	`\|=\|`
Full fragment	`\|<=\|` / `\|=>\|`
Similarity (new)	`\|<~\|` / `\|~>\|` with `sim_filter` / `sim_order`

Common JPC examples¶

Substructure search for two text molecular representations:

SELECT cd_id, cd_structure
FROM structure
WHERE 'c1ccccc1'::MOLECULE('sample') |<| cd_structure::MOLECULE('sample');

Creating a MOLECULE index

JPC can create chemical indexes only on columns of type MOLECULE, not on text.

Add a MOLECULE column (molecule type name must exist in /etc/chemaxon/types/):
```
ALTER TABLE structure ADD COLUMN mol MOLECULE('sample');
```

Backfill from cd_structure:

UPDATE structure SET mol = molecule(cd_structure);

Create a single chemical index on the MOLECULE column

- standard index- for returning all hits:
```
CREATE INDEX structure_chem_idx ON structure USING chemindex(mol);
```
- sorted index - for substructure searches with relevance-sorted results and ORDER BY mol LIMIT n:
```
CREATE INDEX structure_chem_idx ON structure USING sortedchemindex(mol);
```

Duplicate searching an indexed mol column:

SELECT cd_id, cd_structure FROM structure
WHERE 'c1ccccc1'::MOLECULE('sample') |=| mol;

Similarity search (JPC 2.5+)

Use sim_filter (unsorted) or sim_order (sorted by similarity). Similarity value is between 0 and 1; |<~| means similarity ≥ threshold, |~>| means dissimilarity (≤ threshold) and |~| means a comparable similarity.

-- Molecules with similarity to 'CCC' >= 0.8 (unsorted)
SELECT cd_id, mol, mol |~| 'CCC' AS similarity
FROM structure
WHERE ('CCC', 0.8)::sim_filter |<~| mol;

-- Top 20 most similar (use sortedchemindex for best performance)
SELECT cd_id, mol, mol |~| 'CCC' AS similarity
FROM structure
WHERE ('CCC', 0.8)::sim_order |<~| mol
LIMIT 20;

Check validity of structure strings:

SELECT cd_id FROM structure WHERE NOT is_valid_molecule(cd_structure);

Convert to another chemical format e.g. SDF, SMILES:

SELECT molconvert(mol, 'sdf') FROM structure LIMIT 1;
SELECT molconvert(mol, 'smiles') FROM structure LIMIT 1;

Explicit query format (SMILES vs SMARTS):

SELECT cd_id FROM structure
WHERE molecule('CC', 'smiles') |<| mol;
SELECT cd_id FROM structure
WHERE molecule('CC', 'smarts') |<| mol;

Query including parent IDs:

SELECT s.cd_id, s.cd_structure, p.parent_id, p.pcn
FROM structure s
JOIN parent p ON p.structure_id = s.cd_id
WHERE 'c1ccccc1'::MOLECULE('sample') |<| s.cd_structure::MOLECULE('sample');