AutoML in OctaiPipe#

AutoML helps data scientists quickly set up data science pipelines to train an assortment of models. This can be useful if there is a need to test multiple models on the same dataset. The user details the input data and which models to test, and the AutoML pipeline returns which model showed the best performance.

OctaiPipe also has its own autoML implementation centered around Remaining Useful Life estimation. It allows the user to specify a data input and which models to compare. The data goes through a preprocessing and feature engineering pipeline developed by data scientists at T-DAB. While obtaining the best pipeline and model requires experimentation and careful tuning, this setup uses reasonable rules and heuristics based on our experience with RUL estimation. It is a good alternative if the goal is to minimize the time from data collection to a trained model.

AutoML configs#

The following section goes through each field in the automl config and details how to fill it out.

Config example for autoML#

name: automl

input_data_specs:
  datastore_type: influxdb
  query_type: dataframe
  query_template_path: ./configs/data/influx_query.txt
  query_values:
    start: "2020-05-20T13:30:00.000Z"
    stop: "2020-05-20T13:31:00.000Z"
    bucket: sensors-raw
    measurement: cat
    tags: {}
  data_converter: {}

automl_specs:
  make_target: true # whether to create RUL target for data
  initial_RUL: 50 # information on intial RUL and RUL classes can be found here: https://www.docs.octaipipe.ai/usage/steps/automl.html
  rul_classes: 3 # see above link
  cycle_field: cycle_id # data field determining cycle of operation
  solutiontype: regression # regression/classification
  model_name: automl_test # name of model
  model_types:
    - ridge_reg
    - lgb_reg_sk
    - random_forest_reg
  metric: rmse # metric to evaluate model on
  target_label: RUL # name of target if make_target is false

Input data specs#

The first section in this autoML config is the input_data_specs dictionary. This details the source of data for the pipeline. For information on how to fill out this configuration, see the section on input_data_specs in the :OctaiPipe Steps page.

The input should be raw data that is ready to go into preprocessing and feature engineering. The only data fields in the incoming dataset should be ones used in the model training.

If the data is cyclic, meaning one input dataset contains multiple cycles of operation, the dataset also has to contain a cycle field. The name of this cycle field has to be specified in the automl_specs under cycle_field.

The input data can contain an RUL target field, but then this needs to be specified in the target_label field. If the target_label is set in this way, it is important not to set the make_target field to True, otherwise an additional field for RUL will be added and the target_label field will be overwritten as RUL.

AutoML specs#

Below follows each field in the automl_specs and the expected input from the user.

make_target: Whether or not to make an RUL target
initial_RUL: The clipping value of the RUL target. More information on how to set this in :Setting the RUL clipping level
rul_classes: The number of classes for a categorical RUL target. Only 2 or 3 are valid inputs
cycle_field: The column name of the field identifying which cycle a row belongs to
solutiontype: Whether to use classification or regression
model_name: What to name the model, e.g. automl_test_rul
model_types: list of models to compare. Model types can be found at :OctaiPipe Models
metric: The metric to use. For regression rmse, mse or mae. For classification, accuracy, precision, recall, f1-score, roc_auc
target_label: The name of the target label if make_target is False.

Setting the RUL clipping level
- Following Heimes (2008)
- Experimentation of RUL clipping