Preprocessing Step#

To run locally: preprocessing

Real-world data is messy: it is often created, processed and stored by different humans, and automated processes. As a result, it is typical that a dataset contains missing individual fields, manual input errors, duplicate data, and other errors. Data preprocessing is an essential step in a data science pipeline, which transforms the data into a format more easily and effectively used by a statistical/machine learning model.

steps#

OctaiPipe’s Preprocessing step provides the following options to preprocess raw data:

Step

Description

interpolate_missing_data

Uses linear interpolation to impute missing numeric values in the data. Numeric value dtypes are converted to float by default.

remove_constant_columns

Removes columns with variance below var_threshold or categorical columns with only a single label. If var_threshold is not set, it defaults to zero. To be used in a training pipeline only.

make_target

Creates Remaining Useful Life (RUL) target based on linear or piece-wise linear degradation models (details below).

encode_categorical_columns

Encodes a specified set of categorical columns.

normalise

Scaling columns into the same scale.

Currently, OctaiPipe natively supports the following encoders and scalers from scikit-learn:

  • One-hot encoder

  • Ordinal encoder

  • MinMax scaler

  • Standard scaler

  • Robust scaler

Config#

The following are examples of config files respectively for running the step once and periodically, together with descriptions of its parts.

Config example for running the step once#

 1name: preprocessing
 2
 3input_data_specs:
 4  default:
 5  - datastore_type: influxdb
 6    settings:
 7      query_type: dataframe # influx/dataframe/stream/stream_dataframe/csv
 8      query_template_path: ./configs/data/influx_query.txt
 9      query_config:
10        start: "2020-05-20T13:30:00.000Z"
11        stop: "2020-05-20T13:35:00.000Z"
12        bucket: sensors-raw
13        measurement: cat
14        tags: {}
15
16output_data_specs:
17  default:
18  - datastore_type: influxdb
19    settings:
20      bucket: test-bucket-1
21      measurement: testv1
22
23run_specs:
24  save_results: True
25  # run_interval: 2s
26  target_label: RUL
27  label_type: "int"
28  onnx_pred: False
29  train_val_test_split:
30    to_split: false
31    split_ratio:
32      training: 0.6
33      validation: 0.2
34      testing: 0.2
35
36preprocessing_specs:
37  steps: # full list of steps will be in technical documentation
38    - interpolate_missing_data
39    - remove_constant_columns
40    - make_target
41    - encode_categorical_columns
42    - normalise
43  degradation_model: linear # linear, pw_linear
44  initial_RUL: 10
45  var_threshold: 0
46  preprocessors_specs:
47    - type: onehot_encoder
48      load_existing: False
49      name: onehot_encoder_test0
50      # version: "2.3"
51      categorical_columns:
52        - "cat_col1"
53        - "cat_col2"
54    - type: minmax_scaler
55      load_existing: False
56      name: scaler_test0
57      # version: "1.1"

Config example for running the step periodically#

 1name: preprocessing
 2
 3input_data_specs:
 4  default:
 5  - datastore_type: influxdb
 6    settings:
 7      query_type: dataframe
 8      query_template_path: ./configs/data/influx_query_periodic.txt
 9      query_config:
10        start: 2m
11        bucket: live-metrics
12        measurement: live-raw
13        tags: {}
14
15output_data_specs:
16  default:
17  - datastore_type: influxdb
18    settings:
19      bucket: live-metrics
20      measurement: live-processed
21
22run_specs:
23  save_results: True
24  run_interval: 10s
25  label_type: "int"
26  onnx_pred: false
27
28preprocessing_specs:
29  steps: # full list of steps will be in technical documentation
30    - interpolate_missing_data
31    - remove_constant_columns
32    # - make_target
33    - encode_categorical_columns
34    - normalise
35  degradation_model: linear # linear, pw_linear
36  initial_RUL: 10
37  var_threshold: 0
38  preprocessors_specs:
39    - type: onehot_encoder
40      load_existing: True
41      name: onehot_encoder_test0
42      version: "2.3"
43      categorical_columns:
44        - "cat-col1"
45        - "cat-col2"
46    - type: minmax_scaler
47      load_existing: True
48      name: scaler_test0
49      version: "1.1"

Input and Output Data Specs#

input_data_specs and output_data_specs follow a standard format for all the pipeline steps; see Octaipipe Steps.

Run Specs#

20run_specs:
21 save_results: True
22 run_interval: 10s

run_specs provide high level control of the preprocessing step run. Desciptions and options for the config items are given in the table below.

level 1

level 2

level 3

type/options

description

save_results

bool

if False, the step is flushed without saving any of the outputs, only use for testing.

run_interval

str

if this key is present, the step will be run periodcally at the specified interval. Value is given in minute (e.g. 2m), or in second (e.g. 10s).

train_val_test_split

to_split

bool

if True, the whole preprocessed dataset will be split into three sections according to the ratios below and saved with corresponding tags

split_ratio

training

float

ratios defining the data split into training, validation and test, you can then use the tags for model training step subset. Note! The three floats must add up to 1.0

validation

float

testing

float

Preprocessing Specs#

20preprocessing_specs:
21   steps:
22      - interpolate_missing_data
23      - encode_categorical_columns
24      - normalise
25      - remove_constant_columns
26      - make_target
27   degradation_model: linear
28   initial_RUL: 100
29   var_threshold: 0
30   preprocessors_specs:
31      - type: onehot_encoder # Required with `encode_categorical_columns` step
32        load_existing: False
33        name: onehot_encoder_test0
34        # version: "2.3"
35        categorical_columns:
36          - "cat_col1"
37          - "cat_col2"
38      - type: minmax_scaler  # Required with `normalise` step
39        load_existing: False
40        name: scaler_test0
41        # version: "1.1"

preprocessing_specs provides control of the preprocessing steps to be performed.

steps#

The user specifies the steps to be used, out of those listed above.

Note

The execution order of the steps depends on the order of the steps list

degradation_model for make_target#

Note

This section is only applicable for run-to-failure datasets

A database containing multiple run-to-failure datasets can be used to build supervised learning models to predict remaining useful life (RUL). Supervised learning approaches require a target output set, containing the RUL at each timestep in the data. This is, however, not possible to know without a physical model of the system, and hence these datasets do not come with a target RUL output set so one needs to be built. A few common approaches are outlined below.

linear - Linear Degradation Model#

This approach follows the definition of RUL in the strictest sense, using the time remaining before failure (i.e. the remaining number of observations before the end of the dataset).

Linear degradation demonstration

pw_linear - Piece-Wise Linear Degradation Model#

The above approach assumes that the system degrades linearly from the beginning. A more reasonable assumption would be that the system only starts to degrade after a period of time. This approach was first proposed in Recurrent neural networks for remaining useful life estimation for use on the C-MAPSS dataset, with the use of a piece-wise linear degradation model. This model sets an initial, maximum RUL value that the system exists at before linear degradation begins. This approach should provide a more realistic degradation profile and helps to prevent overestimating of RUL. It has been shown empirically to improve performance compared to the above approach for the CMAPSS dataset (Remaining useful life estimation in prognostics using deep convolution neural networks).

Piece-wise linear degradation demonstration Data with piece-wise linear degradation

The above is sensor 7 from the dataset of the first engine unit in CMAPSS. As can be seen visually, the piece-wise linear degradation model is suitable.

initial_RUL for pw_linear Degradation Model#

Note

This section is only applicable for run-to-failure datasets

This is an integer value that defines the healthy useful lifetime of a system. That is, in the piece-wise linear plot above, the initial_RUL is the horizontal line set at a value of 50 time units.

preprocessor_specs#

29preprocessors_specs:
30   - type: onehot_encoder
31      load_existing: False
32      name: onehot_encoder_test0
33      # version: "2.3"
34      categorical_columns:
35      - "cat_col1"
36      - "cat_col2"
37   - type: minmax_scaler
38      load_existing: False
39      name: scaler_test0
40      # version: "1.1"

The user specifies a list of preprocessor objects— encoders and scalers— to be used.

The type is one of the following:

  • onehot_encoder

  • ordinal_encoder

  • minmax_scaler

  • standard_scaler

  • robust_scaler

Note

All fitted encoders and scalers are saved and version-controlled so that the same encoders or scalers can be loaded to transform data during evaluation or inference.

If the step encode_categorical_columns (normalise) is used, at least one encoder (scaler) must be specified, or an error will be thrown. If load_existing is true, the user must specify the preprocessor version to be used; if false, a new preprocessor will be fitted, used to transform the data and will be saved. When the preprocessing step is run periodically (typically in an inference pipeline), all the specified proceprocessors must have load_existing: true— we must use fitted preprocessors to transform data— or an error will be thrown.

Note that to scale the target label, the user needs to specify the name of the target label under run_specs as explained in the following section. The same encoder or scaler that is applied to the features will be used to fit and transform the target label.

Run Specs#

42run_specs:
43 save_results: True
44 run_interval: 10s
45 target_label: RUL
46 onnx_pred: False

run_specs provide high level control of the preprocessing step run. Desciptions and options for the config items are given in the table below.

type/options

description

save_results

bool

if False, the step is flushed without saving any of the outputs, only use for testing.

run_interval

str

if this key is present, the step will be run periodcally at the specified interval. Value is given in minute (e.g. 2m), or in second (e.g. 10s).

target_label

str

the target label in the dataset, if any. if make_target is used, it must be either null (empty) or RUL.

onnx_pred

bool

whether to use the onnx file of a fitted preprocessor to transform data. If false, the joblib preprocessor file is used.