This page is about the configuration of Trainer Engine.
The descriptors to be generated has to be specified in a hJSON configuration file. It defines a set of descriptors for all available descriptor types as an array of type
and descriptors
key-array pairs.
The following configuration file is a template that shows such an example set of descriptors.
{
descriptorGenerator: [
{
type: PHYSCHEM
}
{
// Any topology descriptor having a scalar value from the TopologyAnalyserPlugin can be used.
//https://apidocs.chemaxon.com/jchem/doc/dev/java/api/chemaxon/marvin/calculations/TopologyAnalyserPlugin.html
type: TOPOLOGY
descriptors: [
atomCount
fsp3
heteroRingCount
stericEffectIndex
]
}
{
// Any Chemical Terms function that returns a scalar can be used as a descriptor.
// https://docs.chemaxon.com/display/docs/chemical-terms-functions-in-alphabetic-order.md
type: CHEMTERM
descriptors: [
atomCount('6')
atomCount('7')
formalCharge(majorms('7.4'))
max(pka())
min(pka())
logP()
]
}
{
// The default ECFP fingerprint is 1024-bit-long and has a radius of 4.
// The following ECFP fingerprints are also available:
// ECFP4_256, ECFP4_512, ECFP4_1024, ECFP4_2048
// ECFP6_256, ECFP6_512, ECFP6_1024, ECFP6_2048
type: ECFP4_1024
}
{
type: MACCS
}
{
// Any property defined in an SD file tag of the training file can be
// used as a descriptor.
type: SDFTAG
descriptors: [
DESCRIPTOR1
DESCRIPTOR2
]
}
{
// The kNN (k Nearest Neighbour) Regression descriptor is the weighted average of
// the training property (e.g. hERG) of the 5 most similar molecules in
// the training set. The weights are the similarity values between the
// training set molecules.
type: KNN_REGRESSION
}]
}
Note: Classification-type KNN descriptor can also be used by putting _type: KNNCLASSIFICATION in the config file.
Molecules of the training set can be standardized before descriptor generation using ChemAxon's Standardizer. This can be done by inserting the standardizer: Std_Action_String
line before the descriptorGenerator: block in the hJSON file, where _Std_ActionString defines the Standardizer action string.
The action string contains the sequence of Standardizer actions to be performed. The format of the action string requires each Standardizer action to be separated by ".." from each other.
In this example we define a standardization step, which neutralizes, aromatizes and finally tautomerizes the molecules of the training set. The action string of it is neutralize..aromatize..tautomerize
.
The string then can be put into the following example hJSON config file:
{
standardizer: neutralize..aromatize..tautomerize
descriptorGenerator: [
{
type: PHYSCHEM
}
{
// The default ECFP fingerprint is 1024-bit-long and has a radius of 4.
// The following ECFP fingerprints are also available:
// ECFP4_256, ECFP4_512, ECFP4_1024, ECFP4_2048
// ECFP6_256, ECFP6_512, ECFP6_1024, ECFP6_2048
type: ECFP6_1024
}
{
type: MACCS
}]
}
The settings of a model have to be specified in a configuration file.
The trainer configuration file is a hJSON file that defines a trainer
object with trainerWrapper
, type
, method
and params
keys.
Below are some example hJSON configuration files of the currently available training models.
{
trainer: {
method: REGRESSION
algorithm: LINEAR_REGRESSION
params: {
// Feature transformation. In general, learning algorithms benefit
// from standardization of the data set.
// Available functions:
// Scaler - Scales all numeric variables into the range [0, 1]
// Standardizer - Standardizes numeric feature to 0 mean and unit variance
// MaxAbsScaler - scales each feature by its maximum absolute value
featureTransformer: None
}
}
}
{
trainer: {
method: CLASSIFICATION
algorithm: LOGISTIC_REGRESSION
params: {
// lambda > 0 gives a regularized estimate of linear weights which
// often has superior generalization performance, especially when
// the dimensionality is high.
lambda: 0.1
// The tolerance for stopping iterations.
tol: 1e-5
// The maximum number of iterations.
maxIter: 500
// Feature transformation. In general, learning algorithms benefit
// from standardization of the data set.
// Available functions:
// "Scaler" - Scales all numeric variables into the range [0, 1]
// "Standardizer" - Standardizes numeric feature to 0 mean and unit variance
// "MaxAbsScaler" - scales each feature by its maximum absolute value
featureTransformer: None
}
}
}
{
trainer: {
// This is an optional key which defines an outer wrapper for the
// training type. Currently only CONFORMAL PREDICTION is available
// which allows Error Bound prediction.
// trainerWrapper: CONFORMAL_PREDICTION
method: CLASSIFICATION
// For CLASSIFICATION only RANDOM_FOREST is supported.
algorithm: RANDOM_FOREST
params: {
// The number of trees.
ntrees: 300
// The number of input variables to be used to determine the
// decision at a node of the tree. If p is the number of variables
// floor(sqrt(p)) generally gives good performance.
mtry: 0
// The ratio of input variables to be used to determine the
// decision at a node of the tree. If p is the number of variables
// p / 3 usually gives good performance.
// mtryRatio: 0.35
// The maximum depth of the tree.
maxDepth: 50
// The maximum number of leaf nodes of the tree.
// Default, if not specified: data size / 5
// maxNodes: 50
// The number of instances in a node below which the tree will not
// split, nodeSize = 5 generally gives good results.
nodeSize: 1
// The sampling rate for the training tree. 1.0 means sampling with
// replacement, while < 1.0 means sampling without replacement.
subSample: 1.0
// Priors of the classes. The weight of each class is roughly the
// ratio of samples in each class. For example, if there are 400
// positive samples and 100 negative samples, the classWeight should
// be [1, 4] (assuming label 0 is of negative, label 1 is of
// positive).
// weights: [1, 4]
}
}
}
{
trainer: {
// This is an optional key which defines an outer wrapper for the
// training type. Currently only CONFORMAL PREDICTION is available
// which allows Error Bound prediction.
// trainerWrapper: CONFORMAL_PREDICTION
method: REGRESSION
algorithm: RANDOM_FOREST
params: {
// The number of trees.
ntrees: 300
// The number of input variables to be used to determine the
// decision at a node of the tree. If p is the number of variables
// p / 3 usually gives good performance.
mtry: 0
// The ratio of input variables to be used to determine the
// decision at a node of the tree. If p is the number of variables
// p / 3 usually gives good performance.
// mtryRatio: 0.35
// The maximum depth of the tree.
maxDepth: 50
// The maximum number of leaf nodes of the tree.
// Default, if not specified: data size / 5
// maxNodes: 50
// The number of instances in a node below which the tree will not
// split, nodeSize = 5 generally gives good results.
nodeSize: 1
// The sampling rate for the training tree. 1.0 means sampling with
// replacement, while < 1.0 means sampling without replacement.
subSample: 1.0
}
}
}
Note: Only one of the mtry and mtryRatio parameters can be used in a given config file. Setting both parameters at the same time results in an error.
{
trainer: {
method: CLASSIFICATION
algorithm: SUPPORT_VECTOR_MACHINE
params: {
// The soft margin penalty parameter.
c: 0.5
// The tolerance of convergence test.
tol: 0.1
// Feature transformation. In general, learning algorithms benefit
// from standardization of the data set.
// Available functions:
// Scaler - Scales all numeric variables into the range [0, 1]
// Standardizer - Standardizes numeric feature to 0 mean and unit variance
// MaxAbsScaler - scales each feature by its maximum absolute value
featureTransformer: Scaler
}
}
}
{
trainer: {
method: REGRESSION
algorithm: SUPPORT_VECTOR_REGRESSION
params: {
// Threshold parameter. There is no penalty associated with samples
// which are predicted within distance epsilon from the actual
// value. Decreasing epsilon forces closer fitting to the
// calibration/training data.
eps: 1.0
// The soft margin penalty parameter.
c: 0.5
// The tolerance of convergence test.
tol: 0.1
// Feature transformation. In general, learning algorithms benefit
// from standardization of the data set.
// Available functions:
// Scaler - Scales all numeric variables into the range [0, 1]
// Standardizer - Standardizes numeric feature to 0 mean and unit variance
// MaxAbsScaler - scales each feature by its maximum absolute value
featureTransformer: Scaler
}
}
}
{
trainer: {
method: CLASSIFICATION
algorithm: GRADIENT_TREE_BOOST
params: {
// The number of trees.
ntrees: 500
// The maximum depth of the tree.
maxDepth: 20
// The maximum number of leaf nodes of the tree.
maxNodes: 6
// The number of instances in a node below which the tree will not
// split, nodeSize = 5 generally gives good results.
nodeSize: 5
// The shrinkage parameter in (0, 1] controls the learning rate of
// procedure.
shrinkage: 0.05
// The sampling fraction for stochastic tree boosting.
subSample: 0.7
}
}
}
{
trainer: {
method: REGRESSION
algorithm: GRADIENT_TREE_BOOST
params: {
// Loss function for regression.
// Available functions:
// LeastSquares
// LeastAbsoluteDeviation
// Quantile(p) p = [0, 1]
// Huber(p) p = [0, 1]
lossFunction: LeastAbsoluteDeviation
// The number of trees.
ntrees: 500
// The maximum depth of the tree.
maxDepth: 20
// The maximum number of leaf nodes of the tree.
maxNodes: 6
// The number of instances in a node below which the tree will not
// split, nodeSize = 5 generally gives good results.
nodeSize: 5
// The shrinkage parameter in (0, 1] controls the learning rate of
// procedure.
shrinkage: 0.05
// The sampling fraction for stochastic tree boosting.
subSample: 0.7
}
}
}
Note: The parser of the Trainer Engine supports configuration files in hJSON format, which is a convenient extension of the original JSON format. However, the parser is able to read configuration files in JSON format, too.