AutoML in OctaiPipe#

AutoML helps data scientists quickly set up data science pipelines to train an assortment of models. This can be useful if there is a need to test multiple models on the same dataset. The user details the input data and which models to test, and the AutoML pipeline returns which model showed the best performance.

OctaiPipe also has its own autoML implementation centered around Remaining Useful Life estimation. It allows the user to specify a data input and which models to compare. The data goes through a preprocessing and feature engineering pipeline developed by data scientists at T-DAB. While obtaining the best pipeline and model requires experimentation and careful tuning, this setup uses reasonable rules and heuristics based on our experience with RUL estimation. It is a good alternative if the goal is to minimize the time from data collection to a trained model.

In the OctaiKube documentation, there is a notebook detailing how to run autoML from OctaiKube. This goes into detail of how to set up your OctaiKube environment and what functions can be used in octaikube.automl. This guide will go through the autoML pipeline and how to set up the autoML configuration file.

AutoML configs#

The following section goes through each field in the automl config and details how to fill it out.

Config example for autoML#

 1name: automl
 2
 3input_data_specs:
 4  datastore_type: influxdb
 5  query_type: dataframe
 6  query_template_path: ./configs/data/influx_query.txt
 7  query_values:
 8    start: "2020-05-20T13:30:00.000Z"
 9    stop: "2020-05-20T13:31:00.000Z"
10    bucket: sensors-raw
11    measurement: cat
12    tags: {}
13  data_converter: {}
14
15automl_specs:
16  make_target: true # whether to create RUL target for data
17  initial_RUL: 50 # information on intial RUL and RUL classes can be found here: https://www.docs.octaipipe.ai/usage/steps/automl.html
18  rul_classes: 3 # see above link
19  cycle_field: cycle_id # data field determining cycle of operation
20  solutiontype: regression # regression/classification
21  model_name: automl_test # name of model
22  model_types:
23    - ridge_reg
24    - lgb_reg_sk
25    - random_forest_reg
26  metric: rmse # metric to evaluate model on
27  target_label: RUL # name of target if make_target is false

Input data specs#

The first section in this autoML config is the input_data_specs dictionary. This details the source of data for the pipeline. For information on how to fill out this configuration, see the section on input_data_specs in the :OctaiPipe Steps page.

The input should be raw data that is ready to go into preprocessing and feature engineering. The only data fields in the incoming dataset should be ones used in the model training.

If the data is cyclic, meaning one input dataset contains multiple cycles of operation, the dataset also has to contain a cycle field. The name of this cycle field has to be specified in the automl_specs under cycle_field.

The input data can contain an RUL target field, but then this needs to be specified in the target_label field. If the target_label is set in this way, it is important not to set the make_target field to True, otherwise an additional field for RUL will be added and the target_label field will be overwritten as RUL.

AutoML specs#

Below follows each field in the automl_specs and the expected input from the user.

  • make_target: Whether or not to make an RUL target

  • initial_RUL: The clipping value of the RUL target. More information on how to set this in :Setting the RUL clipping level

  • rul_classes: The number of classes for a categorical RUL target. Only 2 or 3 are valid inputs

  • cycle_field: The column name of the field identifying which cycle a row belongs to

  • solutiontype: Whether to use classification or regression

  • model_name: What to name the model, e.g. automl_test_rul

  • model_types: list of models to compare. Model types can be found at :OctaiPipe Models

  • metric: The metric to use. For regression rmse, mse or mae. For classification, accuracy, precision, recall, f1-score, roc_auc

  • target_label: The name of the target label if make_target is False.