Trainer Engine Configuration

This page is about the configuration of Trainer Engine.

Descriptor generation

The descriptors to be generated has to be specified in a hJSON configuration file. It defines a set of descriptors for all available descriptor types as an array of type and descriptors key-array pairs.

The following configuration file is a template that shows such an example set of descriptors.

{
    descriptorGenerator: [
    {
        type: PHYSCHEM
    }
    {
        // Any topology descriptor having a scalar value from the TopologyAnalyserPlugin can be used.
        //https://apidocs.chemaxon.com/jchem/doc/dev/java/api/chemaxon/marvin/calculations/TopologyAnalyserPlugin.html
        type: TOPOLOGY
        descriptors: [
            atomCount
            fsp3
            heteroRingCount
            stericEffectIndex
        ]
    }
    {
        // Any Chemical Terms function that returns a scalar can be used as a descriptor.
        // https://docs.chemaxon.com/display/docs/chemical-terms-functions-in-alphabetic-order.md
        type: CHEMTERM
        descriptors: [
            atomCount('6')
            atomCount('7')
            formalCharge(majorms('7.4'))
            max(pka())
            min(pka())
            logP()
        ]
    }
    {
        // The default ECFP fingerprint is 1024-bit-long and has a radius of 4.
        // The following ECFP fingerprints are also available:
        //    ECFP4_256, ECFP4_512, ECFP4_1024, ECFP4_2048
        //    ECFP6_256, ECFP6_512, ECFP6_1024, ECFP6_2048
        type: ECFP4_1024
    }
    {
        type: MACCS
    }
    {
        // Any property defined in an SD file tag of the training file can be
        // used as a descriptor.
        type: SDFTAG
        descriptors: [
            DESCRIPTOR1
            DESCRIPTOR2
        ]
    }
    {
        // The kNN (k Nearest Neighbour) Regression descriptor is the weighted average of
        // the training property (e.g. hERG) of the 5 most similar molecules in
        // the training set. The weights are the similarity values between the
        // training set molecules.
        type: KNN_REGRESSION
    }]
}

Note: Classification-type KNN descriptor can also be used by putting _type: KNNCLASSIFICATION in the config file.

Training set standardization

Molecules of the training set can be standardized before descriptor generation using ChemAxon's Standardizer. This can be done by inserting the standardizer: Std_Action_String line before the descriptorGenerator: block in the hJSON file, where _Std_ActionString defines the Standardizer action string.

The action string contains the sequence of Standardizer actions to be performed. The format of the action string requires each Standardizer action to be separated by ".." from each other.

Example

In this example we define a standardization step, which neutralizes, aromatizes and finally tautomerizes the molecules of the training set. The action string of it is neutralize..aromatize..tautomerize .

The string then can be put into the following example hJSON config file:

{
    standardizer: neutralize..aromatize..tautomerize
    descriptorGenerator: [
    {
        type: PHYSCHEM
    }
    {
        // The default ECFP fingerprint is 1024-bit-long and has a radius of 4.
        // The following ECFP fingerprints are also available:
        //    ECFP4_256, ECFP4_512, ECFP4_1024, ECFP4_2048
        //    ECFP6_256, ECFP6_512, ECFP6_1024, ECFP6_2048
        type: ECFP6_1024
    }
    {
        type: MACCS
    }]
}

Training (model building)

The settings of a model have to be specified in a configuration file.

The trainer configuration file is a hJSON file that defines a trainer object with trainerWrapper, type, method and params keys.

Below are some example hJSON configuration files of the currently available training models.

Linear Regression

{
    trainer: {
        method: REGRESSION
        algorithm: LINEAR_REGRESSION
        params: {
            // Feature transformation. In general, learning algorithms benefit
            // from standardization of the data set.
            // Available functions:
            //  Scaler       - Scales all numeric variables into the range [0, 1]
            //  Standardizer - Standardizes numeric feature to 0 mean and unit variance
            //  MaxAbsScaler - scales each feature by its maximum absolute value
            featureTransformer: None
        }
    }
}

Logistic Regression

{
    trainer: {
        method: CLASSIFICATION
        algorithm: LOGISTIC_REGRESSION
        params: {
            // lambda > 0 gives a regularized estimate of linear weights which
            // often has superior generalization performance, especially when
            // the dimensionality is high.
            lambda: 0.1

            // The tolerance for stopping iterations.
            tol: 1e-5

            // The maximum number of iterations.
            maxIter: 500

            // Feature transformation. In general, learning algorithms benefit
            // from standardization of the data set.
            // Available functions:
            //    "Scaler"       - Scales all numeric variables into the range [0, 1]
            //    "Standardizer" - Standardizes numeric feature to 0 mean and unit variance
            //    "MaxAbsScaler" - scales each feature by its maximum absolute value
            featureTransformer: None
        }
    }
}

Random Forest Classification

{
    trainer: {
        // This is an optional key which defines an outer wrapper for the
        // training type. Currently only CONFORMAL PREDICTION is available
        // which allows Error Bound prediction.
        // trainerWrapper: CONFORMAL_PREDICTION

        method: CLASSIFICATION

        // For CLASSIFICATION only RANDOM_FOREST is supported.
        algorithm: RANDOM_FOREST

        params: {
            // The number of trees.
            ntrees: 300

            // The number of input variables to be used to determine the
            // decision at a node of the tree. If p is the number of variables
            // floor(sqrt(p)) generally gives good performance.
            mtry: 0

            // The ratio of input variables to be used to determine the
            // decision at a node of the tree. If p is the number of variables
            // p / 3 usually gives good performance.
            // mtryRatio: 0.35

            // The maximum depth of the tree.
            maxDepth: 50

            // The maximum number of leaf nodes of the tree.
            // Default, if not specified: data size / 5
            // maxNodes: 50

            // The number of instances in a node below which the tree will not
            // split, nodeSize = 5 generally gives good results.
            nodeSize: 1

            // The sampling rate for the training tree. 1.0 means sampling with
            // replacement, while < 1.0 means sampling without replacement.
            subSample: 1.0

            // Priors of the classes. The weight of each class is roughly the
            // ratio of samples in each class. For example, if there are 400
            // positive samples and 100 negative samples, the classWeight should
            // be [1, 4] (assuming label 0 is of negative, label 1 is of
            // positive).
            // weights: [1, 4]
        }
    }
}

Random Forest Regression

{
    trainer: {

        // This is an optional key which defines an outer wrapper for the
        // training type. Currently only CONFORMAL PREDICTION is available
        // which allows Error Bound prediction.
        // trainerWrapper: CONFORMAL_PREDICTION
        method: REGRESSION
        algorithm: RANDOM_FOREST
        params: {
            // The number of trees.
            ntrees: 300

            // The number of input variables to be used to determine the
            // decision at a node of the tree. If p is the number of variables
            // p / 3 usually gives good performance.
            mtry: 0

            // The ratio of input variables to be used to determine the
            // decision at a node of the tree. If p is the number of variables
            // p / 3 usually gives good performance.
            // mtryRatio: 0.35

            // The maximum depth of the tree.
            maxDepth: 50

            // The maximum number of leaf nodes of the tree.
            // Default, if not specified: data size / 5
            // maxNodes: 50

            // The number of instances in a node below which the tree will not
            // split, nodeSize = 5 generally gives good results.
            nodeSize: 1

            // The sampling rate for the training tree. 1.0 means sampling with
            // replacement, while < 1.0 means sampling without replacement.
            subSample: 1.0
        }
    }
}

Note: Only one of the mtry and mtryRatio parameters can be used in a given config file. Setting both parameters at the same time results in an error.

Support Vector Machine Classification

{
    trainer: {
        method: CLASSIFICATION
        algorithm: SUPPORT_VECTOR_MACHINE
        params: {

            // The soft margin penalty parameter.
            c: 0.5

            // The tolerance of convergence test.
            tol: 0.1

            // Feature transformation. In general, learning algorithms benefit
            // from standardization of the data set.
            // Available functions:
            //  Scaler       - Scales all numeric variables into the range [0, 1]
            //  Standardizer - Standardizes numeric feature to 0 mean and unit variance
            //  MaxAbsScaler - scales each feature by its maximum absolute value
            featureTransformer: Scaler
        }
    }
}

Support Vector Machine Regression

{
    trainer: {
        method: REGRESSION
        algorithm: SUPPORT_VECTOR_REGRESSION
        params: {

            // Threshold parameter. There is no penalty associated with samples
            // which are predicted within distance epsilon from the actual
            // value. Decreasing epsilon forces closer fitting to the
            // calibration/training data.
            eps: 1.0

            // The soft margin penalty parameter.
            c: 0.5

            // The tolerance of convergence test.
            tol: 0.1

            // Feature transformation. In general, learning algorithms benefit
            // from standardization of the data set.
            // Available functions:
            //  Scaler       - Scales all numeric variables into the range [0, 1]
            //  Standardizer - Standardizes numeric feature to 0 mean and unit variance
            //  MaxAbsScaler - scales each feature by its maximum absolute value
            featureTransformer: Scaler
        }
    }
}

Gradient Tree Boost Classification

{
    trainer: {

        method: CLASSIFICATION

        algorithm: GRADIENT_TREE_BOOST

        params: {

            // The number of trees.
            ntrees: 500

            // The maximum depth of the tree.
            maxDepth: 20

            // The maximum number of leaf nodes of the tree.
            maxNodes: 6

            // The number of instances in a node below which the tree will not
            // split, nodeSize = 5 generally gives good results.
            nodeSize: 5

            // The shrinkage parameter in (0, 1] controls the learning rate of
            // procedure.
            shrinkage: 0.05

            // The sampling fraction for stochastic tree boosting.
            subSample: 0.7

        }
    }
}

Gradient Tree Boost Regression

{
    trainer: {

        method: REGRESSION

        algorithm: GRADIENT_TREE_BOOST

        params: {

            // Loss function for regression.
            // Available functions:
            //     LeastSquares
            //     LeastAbsoluteDeviation
            //     Quantile(p)              p = [0, 1]
            //     Huber(p)                 p = [0, 1]
            lossFunction: LeastAbsoluteDeviation

            // The number of trees.
            ntrees: 500

            // The maximum depth of the tree.
            maxDepth: 20

            // The maximum number of leaf nodes of the tree.
            maxNodes: 6

            // The number of instances in a node below which the tree will not
            // split, nodeSize = 5 generally gives good results.
            nodeSize: 5

            // The shrinkage parameter in (0, 1] controls the learning rate of
            // procedure.
            shrinkage: 0.05

            // The sampling fraction for stochastic tree boosting.
            subSample: 0.7

        }
    }
}

Note: The parser of the Trainer Engine supports configuration files in hJSON format, which is a convenient extension of the original JSON format. However, the parser is able to read configuration files in JSON format, too.