Model Training Step#
To run locally: model_training
This step performs machine learning model fitting with training data. First, an existing model is loaded from cloud storage, or a new model is built. Training data is then loaded and checked against the model metadata on compatibility. If compatible, the model is (re-)trained with the training data. The trained model is then stored locally, and optionally registered and uploaded to cloud storage together with the model metadata. The user specifies the training data, the type of model to use and whether to train a new model or re-train an existing model.
The following is an example of a config file together with descriptions of its parts.
Step config example#
1name: model_training
2
3input_data_specs:
4 datastore_type: influxdb
5 query_type: dataframe
6 query_template_path: ./configs/data/influx_query.txt
7 query_values:
8 start: "2020-05-20T13:30:00.000Z"
9 stop: "2020-05-20T13:31:00.000Z"
10 bucket: sensors-raw
11 measurement: cat
12 tags: {}
13 data_converter: {}
14
15output_data_specs:
16 - datastore_type: influxdb
17 settings:
18 bucket: test-bucket-1
19 measurement: testv1
20
21model_specs:
22 type: ridge_reg
23 load_existing: false
24 name: pytest_RF
25 model_load_specs:
26 version: '000'
27 params:
28 alpha: 2.0
29
30run_specs:
31 save_results: True
32 target_label: accel_x
33 grid_search:
34 do_grid_search: false
35 grid_definition: ./configs/model_grids/ridge_reg_grid.yml
36 metric: 'mse'
Input and Output Data Specs#
input_data_specs
and output_data_specs
follow a standard format for all the pipeline
steps; see Octaipipe Steps.
Model Specs#
20 model_specs:
21 type: random_forest_reg
22 load_existing: false
23 name: pytest_RF
24 model_load_specs:
25 name: pytest_RF
26 version: '000'
27 params:
28 n_jobs: 2
29 min_samples_split: 4
This section specifies the model you would like to be trained.
type
#
This is a unique designation of the model type, as follows:
ex_lin_reg
: Linear Regressionridge_reg
: Ridge Regressionenet_reg
: Elastic Net Regressionrandom_forest_reg
: Random Forest Regressiontweedie_reg
: Tweedie Regressionquantile_reg
: Quantile Regressionsvr
: Support Vector Regressionsgd_reg
: Stochastic Gradient Descent Regressionbayes_ridge_reg
: Bayesian Ridge Regressionard_reg
: ARD Regressiongbm_reg
: Gradient Boosting Regressionxgb_reg
: Extreme Gradient Boosting Regressionskopt_lgbm_reg
: Skopt LightGBM Modellgb_reg_sk
: LightGBM Regressionlog_reg
: Logistic Regressionknn_class
: k-Nearest Neighbors Classificationsvc
: Support Vector Classifierrandom_forest_class
: Random Forest Classificationxgb_class
: Extreme Gradient Boosting Classificationlgb_class_sk
: LightGBM Classificationkmeans
: KMeansClusteringexp_smooth
: Exponential Smoothingpca
: Principal Component Analysis
More information on model types can be found under OctaiPipe Models.
load_existing
#
A boolean variable to load an existing trained model if set to True
.
name
#
The trained model will be registered under this name in the storage account. We
suggest that the model name relates to the use-case. For example, a name like
machine_degradation
will help organize models.
model_load_specs
#
name
: name of the trained model to load from the storage account. If not provided, thename
ofmodel_specs
(see above) is used instead.version
: version of the model of a given name to load.
params
#
These are the model hyperparameters, where each key-value pair in the dictionary
is a parameter and its corresponding value, e.g. n_jobs=2
for a random forest regressor.
Run Specs#
26 run_specs:
27 save_results: True
28 target_label: accel_x
29 grid_search:
30 do_grid_search: false
31 grid_definition: ./configs/model_grids/random_forest_grid.yml
32 metric: 'mse'
Specifies the needed configuration of the model run.
save_results
#
If set to True
, the trained model will be saved to a Azure Storage Account.
target_label
#
Name of the target variable: column is removed from the input data to form the output set for supervised learning. Not required for unsupervised learning such as clustering.
grid_search
#
Specifications for grid search of model hyperparameters.
do_grid_search
: boolean of whether to do a grid search.grid_definition
: the path to the file specifying the search space of the model hyperparameters. See example YAML for LightGBM below:
0 num_leaves:
1 - 50
2 - 100
3 - 500
4 - 1000
5 max_depth:
6 - 10
7 - 15
8 min_data_in_leaf:
9 - 10
10 - 50
11 - 200
12 learning_rate:
13 - 0.01
14 - 0.05
15 - 0.1
16 n_estimators:
17 - 10
18 - 50
metric
: the evaluation metric to use for the search. Can take values:mse
: mean squared errormae
: mean absolute errorf1_score
: f1 scoreroc_auc
: ROC-AUCrecall
: recallprecision
: precisionaccuracy
: accuracy
Notes for multiclass classification:
When the metrics f1_score
, recall
or precision
are specified, and the number of classes detected in the target variable are more than 2, the sk-learn metric function is passed the weighted
argument as average. The default average is binary
for 2 classes.