Model Training Step#

To run locally: model_training

This step performs machine learning model fitting with training data. First, an existing model is loaded from cloud storage, or a new model is built. Training data is then loaded and checked against the model metadata on compatibility. If compatible, the model is (re-)trained with the training data. The trained model is then stored locally, and optionally registered and uploaded to cloud storage together with the model metadata. The user specifies the training data, the type of model to use and whether to train a new model or re-train an existing model.

The following is an example of a config file together with descriptions of its parts.

Step config example#

name: model_training

input_data_specs:
  default:
  - datastore_type: influxdb
    settings:
      query_type: dataframe
      query_template_path: ./configs/data/influx_query.txt
      query_config:
        start: "2020-05-20T13:30:00.000Z"
        stop: "2020-05-20T13:31:00.000Z"
        bucket: sensors-raw
        measurement: cat 
        tags: {}

output_data_specs:
  default:
  - datastore_type: influxdb
    settings:
      bucket: test-bucket-1
      measurement: testv1

model_specs:
  type: ridge_reg
  load_existing: false
  name: pytest_RF
  model_load_specs:
    version: '000'
  params:
    alpha: 2.0

run_specs:
  save_results: True
  target_label: accel_x
  grid_search:
    do_grid_search: false
    grid_definition: ./configs/model_grids/ridge_reg_grid.yml
    metric: 'mse'

Input and Output Data Specs#

input_data_specs and output_data_specs follow a standard format for all the pipeline steps; see Octaipipe Steps.

Model Specs#

 model_specs:
     type: random_forest_reg
     load_existing: false
     name: pytest_RF
     model_load_specs:
         name: pytest_RF
         version: '000'
     params:
         n_jobs: 2
         min_samples_split: 4

This section specifies the model you would like to be trained.

`type`#

This is a unique designation of the model type, as follows:

ex_lin_reg: Linear Regression
ridge_reg: Ridge Regression
enet_reg: Elastic Net Regression
random_forest_reg: Random Forest Regression
tweedie_reg: Tweedie Regression
quantile_reg: Quantile Regression
svr: Support Vector Regression
sgd_reg: Stochastic Gradient Descent Regression
bayes_ridge_reg: Bayesian Ridge Regression
ard_reg: ARD Regression
gbm_reg: Gradient Boosting Regression
xgb_reg: Extreme Gradient Boosting Regression
skopt_lgbm_reg: Skopt LightGBM Model
lgb_reg_sk: LightGBM Regression
log_reg: Logistic Regression
knn_class: k-Nearest Neighbors Classification
svc: Support Vector Classifier
random_forest_class: Random Forest Classification
xgb_class: Extreme Gradient Boosting Classification
lgb_class_sk: LightGBM Classification
kmeans: KMeansClustering
exp_smooth: Exponential Smoothing
pca: Principal Component Analysis

More information on model types can be found under OctaiPipe Models.

`load_existing`#

A boolean variable to load an existing trained model if set to True.

`name`#

The trained model will be registered under this name in the storage account. We suggest that the model name relates to the use-case. For example, a name like machine_degradation will help organize models.

`model_load_specs`#

name: name of the trained model to load from the storage account. If not provided, the name of model_specs (see above) is used instead.
version: version of the model of a given name to load.

`params`#

These are the model hyperparameters, where each key-value pair in the dictionary is a parameter and its corresponding value, e.g. n_jobs=2 for a random forest regressor.

Run Specs#

 run_specs:
     save_results: True
     target_label: accel_x
     grid_search:
         do_grid_search: false
         grid_definition: ./configs/model_grids/random_forest_grid.yml
         metric: 'mse'

Specifies the needed configuration of the model run.

`save_results`#

If set to True, the trained model will be saved to a Azure Storage Account.

`target_label`#

Name of the target variable: column is removed from the input data to form the output set for supervised learning. Not required for unsupervised learning such as clustering.

`grid_search`#

Specifications for grid search of model hyperparameters.

do_grid_search: boolean of whether to do a grid search.
grid_definition: the path to the file specifying the search space of the model hyperparameters. See example YAML for LightGBM below:

 num_leaves:
   - 50
   - 100
   - 500
   - 1000
 max_depth:
   - 10
   - 15
 min_data_in_leaf:
   - 10
   - 50
   - 200
 learning_rate:
   - 0.01
   - 0.05
   - 0.1
 n_estimators:
   - 10
   - 50

metric: the evaluation metric to use for the search. Can take values:
- mse: mean squared error
- mae: mean absolute error
- f1_score: f1 score
- roc_auc: ROC-AUC
- recall: recall
- precision: precision
- accuracy: accuracy

Notes for multiclass classification: When the metrics f1_score, recall or precision are specified, and the number of classes detected in the target variable are more than 2, the sk-learn metric function is passed the weighted argument as average. The default average is binary for 2 classes.

Model Training Step#

Step config example#

Input and Output Data Specs#

Model Specs#

type#

load_existing#

name#

model_load_specs#

params#