Trainer Engine Configuration

    This page is about the configuration of Trainer Engine.

    Descriptor generation

    The descriptors to be generated has to be specified in a hJSON configuration file. It defines a set of descriptors for all available descriptor types as an array of type and descriptors key-array pairs.

    The following configuration file is a template that shows such an example set of descriptors.

    {
        descriptorGenerator: [
        {
            type: PHYSCHEM
        }
        {
            // Any topology descriptor having a scalar value from the TopologyAnalyserPlugin can be used.
            //https://apidocs.chemaxon.com/jchem/doc/dev/java/api/chemaxon/marvin/calculations/TopologyAnalyserPlugin.html
            type: TOPOLOGY
            descriptors: [
                atomCount
                fsp3
                heteroRingCount
                stericEffectIndex
            ]
        }
        {
            // Any Chemical Terms function that returns a scalar can be used as a descriptor.
            // https://docs.chemaxon.com/display/docs/chemical-terms-functions-in-alphabetic-order.md
            type: CHEMTERM
            descriptors: [
                atomCount('6')
                atomCount('7')
                formalCharge(majorms('7.4'))
                max(pka())
                min(pka())
                logP()
            ]
        }
        {
            // The default ECFP fingerprint is 1024-bit-long and has a radius of 4.
            // The following ECFP fingerprints are also available:
            //    ECFP4_256, ECFP4_512, ECFP4_1024, ECFP4_2048
            //    ECFP6_256, ECFP6_512, ECFP6_1024, ECFP6_2048
            type: ECFP4_1024
        }
        {
            type: MACCS
        }
        {
            // Any property defined in an SD file tag of the training file can be
            // used as a descriptor.
            type: SDFTAG
            descriptors: [
                DESCRIPTOR1
                DESCRIPTOR2
            ]
        }
        {
            // The kNN (k Nearest Neighbour) Regression descriptor is the weighted average of
            // the training property (e.g. hERG) of the 5 most similar molecules in
            // the training set. The weights are the similarity values between the
            // training set molecules.
            type: KNN_REGRESSION
        }]
    }

    Note: Classification-type KNN descriptor can also be used by putting _type: KNNCLASSIFICATION in the config file.

    Training set standardization

    Molecules of the training set can be standardized before descriptor generation using ChemAxon's Standardizer. This can be done by inserting the standardizer: Std_Action_String line before the descriptorGenerator: block in the hJSON file, where _Std_ActionString defines the Standardizer action string.

    The action string contains the sequence of Standardizer actions to be performed. The format of the action string requires each Standardizer action to be separated by ".." from each other.

    Example

    In this example we define a standardization step, which neutralizes, aromatizes and finally tautomerizes the molecules of the training set. The action string of it is neutralize..aromatize..tautomerize .

    The string then can be put into the following example hJSON config file:

    {
        standardizer: neutralize..aromatize..tautomerize
        descriptorGenerator: [
        {
            type: PHYSCHEM
        }
        {
            // The default ECFP fingerprint is 1024-bit-long and has a radius of 4.
            // The following ECFP fingerprints are also available:
            //    ECFP4_256, ECFP4_512, ECFP4_1024, ECFP4_2048
            //    ECFP6_256, ECFP6_512, ECFP6_1024, ECFP6_2048
            type: ECFP6_1024
        }
        {
            type: MACCS
        }]
    }

    Training (model building)

    The settings of a model have to be specified in a configuration file.

    The trainer configuration file is a hJSON file that defines a trainer object with trainerWrapper, type, method and params keys.

    Below are some example hJSON configuration files of the currently available training models.

    Linear Regression

    {
        trainer: {
            method: REGRESSION
            algorithm: LINEAR_REGRESSION
            params: {
                // Feature transformation. In general, learning algorithms benefit
                // from standardization of the data set.
                // Available functions:
                //  Scaler       - Scales all numeric variables into the range [0, 1]
                //  Standardizer - Standardizes numeric feature to 0 mean and unit variance
                //  MaxAbsScaler - scales each feature by its maximum absolute value
                featureTransformer: None
            }
        }
    }

    Logistic Regression

    {
        trainer: {
            method: CLASSIFICATION
            algorithm: LOGISTIC_REGRESSION
            params: {
                // lambda > 0 gives a regularized estimate of linear weights which
                // often has superior generalization performance, especially when
                // the dimensionality is high.
                lambda: 0.1
    
                // The tolerance for stopping iterations.
                tol: 1e-5
    
                // The maximum number of iterations.
                maxIter: 500
    
                // Feature transformation. In general, learning algorithms benefit
                // from standardization of the data set.
                // Available functions:
                //    "Scaler"       - Scales all numeric variables into the range [0, 1]
                //    "Standardizer" - Standardizes numeric feature to 0 mean and unit variance
                //    "MaxAbsScaler" - scales each feature by its maximum absolute value
                featureTransformer: None
            }
        }
    }

    Random Forest Classification

    {
        trainer: {
            // This is an optional key which defines an outer wrapper for the
            // training type. Currently only CONFORMAL PREDICTION is available
            // which allows Error Bound prediction.
            // trainerWrapper: CONFORMAL_PREDICTION
    
            method: CLASSIFICATION
    
            // For CLASSIFICATION only RANDOM_FOREST is supported.
            algorithm: RANDOM_FOREST
    
            params: {
                // The number of trees.
                ntrees: 300
    
                // The number of input variables to be used to determine the
                // decision at a node of the tree. If p is the number of variables
                // floor(sqrt(p)) generally gives good performance.
                mtry: 0
    
                // The ratio of input variables to be used to determine the
                // decision at a node of the tree. If p is the number of variables
                // p / 3 usually gives good performance.
                // mtryRatio: 0.35
    
                // The maximum depth of the tree.
                maxDepth: 50
    
                // The maximum number of leaf nodes of the tree.
                // Default, if not specified: data size / 5
                // maxNodes: 50
    
                // The number of instances in a node below which the tree will not
                // split, nodeSize = 5 generally gives good results.
                nodeSize: 1
    
                // The sampling rate for the training tree. 1.0 means sampling with
                // replacement, while < 1.0 means sampling without replacement.
                subSample: 1.0
    
                // Priors of the classes. The weight of each class is roughly the
                // ratio of samples in each class. For example, if there are 400
                // positive samples and 100 negative samples, the classWeight should
                // be [1, 4] (assuming label 0 is of negative, label 1 is of
                // positive).
                // weights: [1, 4]
            }
        }
    }

    Random Forest Regression

    {
        trainer: {
    
            // This is an optional key which defines an outer wrapper for the
            // training type. Currently only CONFORMAL PREDICTION is available
            // which allows Error Bound prediction.
            // trainerWrapper: CONFORMAL_PREDICTION
            method: REGRESSION
            algorithm: RANDOM_FOREST
            params: {
                // The number of trees.
                ntrees: 300
    
                // The number of input variables to be used to determine the
                // decision at a node of the tree. If p is the number of variables
                // p / 3 usually gives good performance.
                mtry: 0
    
                // The ratio of input variables to be used to determine the
                // decision at a node of the tree. If p is the number of variables
                // p / 3 usually gives good performance.
                // mtryRatio: 0.35
    
                // The maximum depth of the tree.
                maxDepth: 50
    
                // The maximum number of leaf nodes of the tree.
                // Default, if not specified: data size / 5
                // maxNodes: 50
    
                // The number of instances in a node below which the tree will not
                // split, nodeSize = 5 generally gives good results.
                nodeSize: 1
    
                // The sampling rate for the training tree. 1.0 means sampling with
                // replacement, while < 1.0 means sampling without replacement.
                subSample: 1.0
            }
        }
    }

    Note: Only one of the mtry and mtryRatio parameters can be used in a given config file. Setting both parameters at the same time results in an error.

    Support Vector Machine Classification

    {
        trainer: {
            method: CLASSIFICATION
            algorithm: SUPPORT_VECTOR_MACHINE
            params: {
    
                // The soft margin penalty parameter.
                c: 0.5
    
                // The tolerance of convergence test.
                tol: 0.1
    
                // Feature transformation. In general, learning algorithms benefit
                // from standardization of the data set.
                // Available functions:
                //  Scaler       - Scales all numeric variables into the range [0, 1]
                //  Standardizer - Standardizes numeric feature to 0 mean and unit variance
                //  MaxAbsScaler - scales each feature by its maximum absolute value
                featureTransformer: Scaler
            }
        }
    }

    Support Vector Machine Regression

    {
        trainer: {
            method: REGRESSION
            algorithm: SUPPORT_VECTOR_REGRESSION
            params: {
    
                // Threshold parameter. There is no penalty associated with samples
                // which are predicted within distance epsilon from the actual
                // value. Decreasing epsilon forces closer fitting to the
                // calibration/training data.
                eps: 1.0
    
                // The soft margin penalty parameter.
                c: 0.5
    
                // The tolerance of convergence test.
                tol: 0.1
    
                // Feature transformation. In general, learning algorithms benefit
                // from standardization of the data set.
                // Available functions:
                //  Scaler       - Scales all numeric variables into the range [0, 1]
                //  Standardizer - Standardizes numeric feature to 0 mean and unit variance
                //  MaxAbsScaler - scales each feature by its maximum absolute value
                featureTransformer: Scaler
            }
        }
    }

    Gradient Tree Boost Classification

    {
        trainer: {
    
            method: CLASSIFICATION
    
            algorithm: GRADIENT_TREE_BOOST
    
            params: {
    
                // The number of trees.
                ntrees: 500
    
                // The maximum depth of the tree.
                maxDepth: 20
    
                // The maximum number of leaf nodes of the tree.
                maxNodes: 6
    
                // The number of instances in a node below which the tree will not
                // split, nodeSize = 5 generally gives good results.
                nodeSize: 5
    
                // The shrinkage parameter in (0, 1] controls the learning rate of
                // procedure.
                shrinkage: 0.05
    
                // The sampling fraction for stochastic tree boosting.
                subSample: 0.7
    
            }
        }
    }

    Gradient Tree Boost Regression

    {
        trainer: {
    
            method: REGRESSION
    
            algorithm: GRADIENT_TREE_BOOST
    
            params: {
    
                // Loss function for regression.
                // Available functions:
                //     LeastSquares
                //     LeastAbsoluteDeviation
                //     Quantile(p)              p = [0, 1]
                //     Huber(p)                 p = [0, 1]
                lossFunction: LeastAbsoluteDeviation
    
                // The number of trees.
                ntrees: 500
    
                // The maximum depth of the tree.
                maxDepth: 20
    
                // The maximum number of leaf nodes of the tree.
                maxNodes: 6
    
                // The number of instances in a node below which the tree will not
                // split, nodeSize = 5 generally gives good results.
                nodeSize: 5
    
                // The shrinkage parameter in (0, 1] controls the learning rate of
                // procedure.
                shrinkage: 0.05
    
                // The sampling fraction for stochastic tree boosting.
                subSample: 0.7
    
            }
        }
    }

    Note: The parser of the Trainer Engine supports configuration files in hJSON format, which is a convenient extension of the original JSON format. However, the parser is able to read configuration files in JSON format, too.