Preprocessing Step#

To run locally: preprocessing

Real-world data is messy: it is often created, processed and stored by different humans, and automated processes. As a result, it is typical that a dataset contains missing individual fields, manual input errors, duplicate data, and other errors. Data preprocessing is an essential step in a data science pipeline, which transforms the data into a format more easily and effectively used by a statistical/machine learning model.

steps#

OctaiPipe’s Preprocessing step provides the following options to preprocess raw data:

Step

Description

interpolate_missing_data

Uses linear interpolation to impute missing numeric values in the data. Numeric value dtypes are converted to float by default.

remove_constant_columns

Removes columns with variance below var_threshold or categorical columns with only a single label. If var_threshold is not set, it defaults to zero. To be used in a training pipeline only.

make_target

Creates Remaining Useful Life (RUL) target based on linear or piece-wise linear degradation models (details below).

encode_categorical_columns

Encodes a specified set of categorical columns.

normalise

Scaling columns into the same scale.

Currently, OctaiPipe natively supports the following encoders and scalers from scikit-learn:

  • One-hot encoder

  • Ordinal encoder

  • MinMax scaler

  • Standard scaler

  • Robust scaler

Config#

The following are examples of config files respectively for running the step once and periodically, together with descriptions of its parts.

Config example for running the step once#

 1name: preprocessing
 2
 3input_data_specs:
 4  datastore_type: influxdb
 5  query_type: dataframe # influx/dataframe/stream/stream_dataframe/csv
 6  query_template_path: ./configs/data/influx_query.txt
 7  query_values:
 8    start: "2020-05-20T13:30:00.000Z"
 9    stop: "2020-05-20T13:35:00.000Z"
10    bucket: sensors-raw
11    measurement: cat
12    tags: {}
13  data_converter: {}
14
15output_data_specs:
16  - datastore_type: influxdb
17    settings:
18      bucket: test-bucket-1
19      measurement: testv1
20
21run_specs:
22  save_results: True
23  # run_interval: 2s
24  target_label: RUL
25  label_type: "int"
26  onnx_pred: False
27  train_val_test_split:
28    to_split: false
29    split_ratio:
30      training: 0.6
31      validation: 0.2
32      testing: 0.2
33
34preprocessing_specs:
35  steps: # full list of steps will be in technical documentation
36    - interpolate_missing_data
37    - remove_constant_columns
38    - make_target
39    - encode_categorical_columns
40    - normalise
41  degradation_model: linear # linear, pw_linear
42  initial_RUL: 10
43  var_threshold: 0
44  preprocessors_specs:
45    - type: onehot_encoder
46      load_existing: False
47      name: onehot_encoder_test0
48      # version: "2.3"
49      categorical_columns:
50        - "cat_col1"
51        - "cat_col2"
52    - type: minmax_scaler
53      load_existing: False
54      name: scaler_test0
55      # version: "1.1"

Config example for running the step periodically#

 1name: preprocessing
 2
 3input_data_specs:
 4  datastore_type: influxdb
 5  query_type: dataframe
 6  query_template_path: ./configs/data/influx_query_periodic.txt
 7  query_values:
 8    start: 2m
 9    bucket: live-metrics
10    measurement: live-raw
11    tags: {}
12  data_converter: {}
13
14
15output_data_specs:
16  - datastore_type: influxdb
17    settings:
18      bucket: live-metrics
19      measurement: live-processed
20
21run_specs:
22  save_results: True
23  run_interval: 10s
24  label_type: "int"
25  onnx_pred: false
26
27preprocessing_specs:
28  steps: # full list of steps will be in technical documentation
29    - interpolate_missing_data
30    - remove_constant_columns
31    # - make_target
32    - encode_categorical_columns
33    - normalise
34  degradation_model: linear # linear, pw_linear
35  initial_RUL: 10
36  var_threshold: 0
37  preprocessors_specs:
38    - type: onehot_encoder
39      load_existing: True
40      name: onehot_encoder_test0
41      version: "2.3"
42      categorical_columns:
43        - "cat-col1"
44        - "cat-col2"
45    - type: minmax_scaler
46      load_existing: True
47      name: scaler_test0
48      version: "1.1"

Input and Output Data Specs#

input_data_specs and output_data_specs follow a standard format for all the pipeline steps; see Octaipipe Steps.

Run Specs#

20run_specs:
21 save_results: True
22 run_interval: 10s

run_specs provide high level control of the preprocessing step run. Desciptions and options for the config items are given in the table below.

level 1

level 2

level 3

type/options

description

save_results

bool

if False, the step is flushed without saving any of the outputs, only use for testing.

run_interval

str

if this key is present, the step will be run periodcally at the specified interval. Value is given in minute (e.g. 2m), or in second (e.g. 10s).

train_val_test_split

to_split

bool

if True, the whole preprocessed dataset will be split into three sections according to the ratios below and saved with corresponding tags

split_ratio

training

float

ratios defining the data split into training, validation and test, you can then use the tags for model training step subset. Note! The three floats must add up to 1.0

validation

float

testing

float

Preprocessing Specs#

20preprocessing_specs:
21   steps:
22      - interpolate_missing_data
23      - encode_categorical_columns
24      - normalise
25      - remove_constant_columns
26      - make_target
27   degradation_model: linear
28   initial_RUL: 100
29   var_threshold: 0
30   preprocessors_specs:
31      - type: onehot_encoder # Required with `encode_categorical_columns` step
32        load_existing: False
33        name: onehot_encoder_test0
34        # version: "2.3"
35        categorical_columns:
36          - "cat_col1"
37          - "cat_col2"
38      - type: minmax_scaler  # Required with `normalise` step
39        load_existing: False
40        name: scaler_test0
41        # version: "1.1"

preprocessing_specs provides control of the preprocessing steps to be performed.

steps#

The user specifies the steps to be used, out of those listed above.

Note

The execution order of the steps depends on the order of the steps list

degradation_model for make_target#

Note

This section is only applicable for run-to-failure datasets

A database containing multiple run-to-failure datasets can be used to build supervised learning models to predict remaining useful life (RUL). Supervised learning approaches require a target output set, containing the RUL at each timestep in the data. This is, however, not possible to know without a physical model of the system, and hence these datasets do not come with a target RUL output set so one needs to be built. A few common approaches are outlined below.

linear - Linear Degradation Model#

This approach follows the definition of RUL in the strictest sense, using the time remaining before failure (i.e. the remaining number of observations before the end of the dataset).

Linear degradation demonstration

pw_linear - Piece-Wise Linear Degradation Model#

The above approach assumes that the system degrades linearly from the beginning. A more reasonable assumption would be that the system only starts to degrade after a period of time. This approach was first proposed in Recurrent neural networks for remaining useful life estimation for use on the C-MAPSS dataset, with the use of a piece-wise linear degradation model. This model sets an initial, maximum RUL value that the system exists at before linear degradation begins. This approach should provide a more realistic degradation profile and helps to prevent overestimating of RUL. It has been shown empirically to improve performance compared to the above approach for the CMAPSS dataset (Remaining useful life estimation in prognostics using deep convolution neural networks).

Piece-wise linear degradation demonstration Data with piece-wise linear degradation

The above is sensor 7 from the dataset of the first engine unit in CMAPSS. As can be seen visually, the piece-wise linear degradation model is suitable.

initial_RUL for pw_linear Degradation Model#

Note

This section is only applicable for run-to-failure datasets

This is an integer value that defines the healthy useful lifetime of a system. That is, in the piece-wise linear plot above, the initial_RUL is the horizontal line set at a value of 50 time units.

preprocessor_specs#

29preprocessors_specs:
30   - type: onehot_encoder
31      load_existing: False
32      name: onehot_encoder_test0
33      # version: "2.3"
34      categorical_columns:
35      - "cat_col1"
36      - "cat_col2"
37   - type: minmax_scaler
38      load_existing: False
39      name: scaler_test0
40      # version: "1.1"

The user specifies a list of preprocessor objects— encoders and scalers— to be used.

The type is one of the following:

  • onehot_encoder

  • ordinal_encoder

  • minmax_scaler

  • standard_scaler

  • robust_scaler

Note

All fitted encoders and scalers are saved and version-controlled so that the same encoders or scalers can be loaded to transform data during evaluation or inference.

If the step encode_categorical_columns (normalise) is used, at least one encoder (scaler) must be specified, or an error will be thrown. If load_existing is true, the user must specify the preprocessor version to be used; if false, a new preprocessor will be fitted, used to transform the data and will be saved. When the preprocessing step is run periodically (typically in an inference pipeline), all the specified proceprocessors must have load_existing: true— we must use fitted preprocessors to transform data— or an error will be thrown.

Note that to scale the target label, the user needs to specify the name of the target label under run_specs as explained in the following section. The same encoder or scaler that is applied to the features will be used to fit and transform the target label.

Run Specs#

42run_specs:
43 save_results: True
44 run_interval: 10s
45 target_label: RUL
46 onnx_pred: False

run_specs provide high level control of the preprocessing step run. Desciptions and options for the config items are given in the table below.

type/options

description

save_results

bool

if False, the step is flushed without saving any of the outputs, only use for testing.

run_interval

str

if this key is present, the step will be run periodcally at the specified interval. Value is given in minute (e.g. 2m), or in second (e.g. 10s).

target_label

str

the target label in the dataset, if any. if make_target is used, it must be either null (empty) or RUL.

onnx_pred

bool

whether to use the onnx file of a fitted preprocessor to transform data. If false, the joblib preprocessor file is used.