AutoML in OctaiPipe#
AutoML helps data scientists quickly set up data science pipelines to train an assortment of models. This can be useful if there is a need to test multiple models on the same dataset. The user details the input data and which models to test, and the AutoML pipeline returns which model showed the best performance.
OctaiPipe also has its own autoML implementation centered around Remaining Useful Life estimation. It allows the user to specify a data input and which models to compare. The data goes through a preprocessing and feature engineering pipeline developed by data scientists at T-DAB. While obtaining the best pipeline and model requires experimentation and careful tuning, this setup uses reasonable rules and heuristics based on our experience with RUL estimation. It is a good alternative if the goal is to minimize the time from data collection to a trained model.
AutoML configs#
The following section goes through each field in the automl config and details how to fill it out.
Config example for autoML#
1name: automl
2
3input_data_specs:
4 datastore_type: influxdb
5 query_type: dataframe
6 query_template_path: ./configs/data/influx_query.txt
7 query_values:
8 start: "2020-05-20T13:30:00.000Z"
9 stop: "2020-05-20T13:31:00.000Z"
10 bucket: sensors-raw
11 measurement: cat
12 tags: {}
13 data_converter: {}
14
15automl_specs:
16 make_target: true # whether to create RUL target for data
17 initial_RUL: 50 # information on intial RUL and RUL classes can be found here: https://www.docs.octaipipe.ai/usage/steps/automl.html
18 rul_classes: 3 # see above link
19 cycle_field: cycle_id # data field determining cycle of operation
20 solutiontype: regression # regression/classification
21 model_name: automl_test # name of model
22 model_types:
23 - ridge_reg
24 - lgb_reg_sk
25 - random_forest_reg
26 metric: rmse # metric to evaluate model on
27 target_label: RUL # name of target if make_target is false
Input data specs#
The first section in this autoML config is the input_data_specs dictionary. This details the source of data for the pipeline. For information on how to fill out this configuration, see the section on input_data_specs in the :OctaiPipe Steps page.
The input should be raw data that is ready to go into preprocessing and feature engineering. The only data fields in the incoming dataset should be ones used in the model training.
If the data is cyclic, meaning one input dataset contains multiple cycles of operation, the dataset also has to contain a cycle field. The name of this cycle field has to be specified in the automl_specs under cycle_field.
The input data can contain an RUL target field, but then this needs to be specified in the target_label field. If the target_label is set in this way, it is important not to set the make_target field to True, otherwise an additional field for RUL will be added and the target_label field will be overwritten as RUL.
AutoML specs#
Below follows each field in the automl_specs and the expected input from the user.
make_target: Whether or not to make an RUL target
initial_RUL: The clipping value of the RUL target. More information on how to set this in :Setting the RUL clipping level
rul_classes: The number of classes for a categorical RUL target. Only 2 or 3 are valid inputs
cycle_field: The column name of the field identifying which cycle a row belongs to
solutiontype: Whether to use classification or regression
model_name: What to name the model, e.g. automl_test_rul
model_types: list of models to compare. Model types can be found at :OctaiPipe Models
metric: The metric to use. For regression rmse, mse or mae. For classification, accuracy, precision, recall, f1-score, roc_auc
target_label: The name of the target label if make_target is False.