Model Training Step#

To run locally: model_training

This step performs machine learning model fitting with training data. First, an existing model is loaded from cloud storage, or a new model is built. Training data is then loaded and checked against the model metadata on compatibility. If compatible, the model is (re-)trained with the training data. The trained model is then stored locally, and optionally registered and uploaded to cloud storage together with the model metadata. The user specifies the training data, the type of model to use and whether to train a new model or re-train an existing model.

The following is an example of a config file together with descriptions of its parts.

Step config example#

 1name: model_training
 4  datastore_type: influxdb
 5  query_type: dataframe
 6  query_template_path: ./configs/data/influx_query.txt
 7  query_values:
 8    start: "2020-05-20T13:30:00.000Z"
 9    stop: "2020-05-20T13:31:00.000Z"
10    bucket: sensors-raw
11    measurement: cat 
12    tags: {}
13  data_converter: {}
16  - datastore_type: influxdb
17    settings:
18      bucket: test-bucket-1
19      measurement: testv1
22  type: ridge_reg
23  load_existing: false
24  name: pytest_RF
25  model_load_specs:
26    version: '000'
27  params:
28    alpha: 2.0
31  save_results: True
32  target_label: accel_x
33  grid_search:
34    do_grid_search: false
35    grid_definition: ./configs/model_grids/ridge_reg_grid.yml
36    metric: 'mse'

Input and Output Data Specs#

input_data_specs and output_data_specs follow a standard format for all the pipeline steps; see Octaipipe Steps.

Model Specs#

20 model_specs:
21     type: random_forest_reg
22     load_existing: false
23     name: pytest_RF
24     model_load_specs:
25         name: pytest_RF
26         version: '000'
27     params:
28         n_jobs: 2
29         min_samples_split: 4

This section specifies the model you would like to be trained.


This is a unique designation of the model type, as follows:

  • ex_lin_reg: Linear Regression

  • ridge_reg: Ridge Regression

  • enet_reg: Elastic Net Regression

  • random_forest_reg: Random Forest Regression

  • tweedie_reg: Tweedie Regression

  • quantile_reg: Quantile Regression

  • svr: Support Vector Regression

  • sgd_reg: Stochastic Gradient Descent Regression

  • bayes_ridge_reg: Bayesian Ridge Regression

  • ard_reg: ARD Regression

  • gbm_reg: Gradient Boosting Regression

  • xgb_reg: Extreme Gradient Boosting Regression

  • skopt_lgbm_reg: Skopt LightGBM Model

  • lgb_reg_sk: LightGBM Regression

  • log_reg: Logistic Regression

  • knn_class: k-Nearest Neighbors Classification

  • svc: Support Vector Classifier

  • random_forest_class: Random Forest Classification

  • xgb_class: Extreme Gradient Boosting Classification

  • lgb_class_sk: LightGBM Classification

  • kmeans: KMeansClustering

  • exp_smooth: Exponential Smoothing

  • pca: Principal Component Analysis

More information on model types can be found under OctaiPipe Models.


A boolean variable to load an existing trained model if set to True.


The trained model will be registered under this name in the storage account. We suggest that the model name relates to the use-case. For example, a name like machine_degradation will help organize models.


  • name: name of the trained model to load from the storage account. If not provided, the name of model_specs (see above) is used instead.

  • version: version of the model of a given name to load.


These are the model hyperparameters, where each key-value pair in the dictionary is a parameter and its corresponding value, e.g. n_jobs=2 for a random forest regressor.

Run Specs#

26 run_specs:
27     save_results: True
28     target_label: accel_x
29     grid_search:
30         do_grid_search: false
31         grid_definition: ./configs/model_grids/random_forest_grid.yml
32         metric: 'mse'

Specifies the needed configuration of the model run.


If set to True, the trained model will be saved to a Azure Storage Account.


Name of the target variable: column is removed from the input data to form the output set for supervised learning. Not required for unsupervised learning such as clustering.