Introduction

    This documentation gives a short introduction to Chemaxon's Trainer Engine.

    What is Trainer Engine?

    Chemaxon's Trainer Engine is a tool to predict molecular properties by training machine learning models on input data sets. This tool supports the model life cycle management with:

    Data preparation

    • Normalization of chemical structures (standardization)
    • Transformation of molecules to descriptors (feature generation)

    Model training

    • Fitting various model types on the input descriptors and labeled data
    • Calculation of statistics to measure model accuracy

    Optimization and validation

    • Visualization and comparison of model details
    • Provides a central repository of training and prediction runs for reproducibility

    Deployment

    • Integration end-points for model building and inference
    • Makes predictions on novel molecules interactively or in batches
    • Enrichment of predicted values with applicability domain information and error prediction

    What is the benefit of using the Trainer Engine?

    Trainer Engine translates input data into executable predictions. It has been used to build successful models for wide range of measured data types including:

    • pyhs-chem properties (e.g. boiling point, vapor pressure, logP, logD)
    • analytical chemistry data (retention time)
    • ADMET end-points (PAMPA, Caco2 permeability, hERG, MetStab, BBB penetration, CYP inhibition, PAINS)
    • on-target assay end-points (different target families includes receptors like GPCRs, enzymes e.g. kinases and transporters).

    Trainer Engine predictions are supporting medicinal and analytical chemists, toxicologists, drug discovery project team members to assess risks and opportunities of the compound collections. It enables computational chemists to experiment with different model types, explain their behavior and seamlessly provide access to high-quality models for larger audiences.

    Overview of the components

    Trainer Engine is a service application interfaced with

    • Trainer web user interface (GUI) for model building, assessment, optimization, analysis and management workflows
    • REST API end point with SWAGGER API docs for integration
    • Default integration with an interactive single or batch prediction web interface, called Playground

    The trainer service includes Standardizer, Descriptor generation, ML model building libraries. It calculates the statistical values and implements multi-step workflows, like conformal prediction. All input molecules, configurations, calculated values and statistics are stored in a persistency layer in a PostgreSQL database.

    Overview of the general workflow

    The general machine learning workflow consits of the following steps:

    1. Upload the input file (sdf) containing molecules and labeled data
    2. Configuration of standardization actions and descriptor generation
    3. Configuration of training (model building) and validation via split
    4. Model building, that can be extended with configurable applicability domain assessment and error prediction
    5. Automatic model assessment on the training set and test set
    6. Visualization of the accuracy metrics, model details (feature importance values) and relationships between chemical structures and prediction accuracy
    7. Optimization of descriptor generation and training parameters (iteration from step 2)
    8. Selecting the best scored models and setting "In production" flag
    9. Predict new molecules in singe or batch mode with Playground interface
    10. Integrate production models into design workflows using the REST interface

    Programmatic interaction

    The REST API end points offer all utilities to ingest data, train models and get statistical assessment results. This extension capability allows automatic parameter optimization and selection of the best performing models. API docs is provided via SWAGGER, proof of concept Jupyter notebooks covering larger workflows are available on demand. Contact us for further details: calculators@chemaxon.com.

    Design ecosystem

    Seamless early stage discovery project and hypothesis management ecosystem is offered through integrating Trainer Engine with Chemaxon's Design Hub. In this setup the built models are made available as Design Hub plugins supporting triage of hypothetical molecules and fostering multi parameter optimization when ideas are assessed based on the most important attributes delivered by Trainer Engine and the additional plugins available in Design Hub. Throughout the the Trainer Engine API interface, models can be asynchronously built or re-trained and deployed into Design Hub.