Preprocessing Step#

To run locally: preprocessing

Real-world data is messy: it is often created, processed and stored by different humans, and automated processes. As a result, it is typical that a dataset contains missing individual fields, manual input errors, duplicate data, and other errors. Data preprocessing is an essential step in a data science pipeline, which transforms the data into a format more easily and effectively used by a statistical/machine learning model.

`steps`#

OctaiPipe’s Preprocessing step provides the following options to preprocess raw data:

Step	Description
`interpolate_missing_data`	Uses linear interpolation to impute missing numeric values in the data. Numeric value dtypes are converted to float by default.
`remove_constant_columns`	Removes columns with variance below `var_threshold` or categorical columns with only a single label. If `var_threshold` is not set, it defaults to zero. To be used in a training pipeline only.
`make_target`	Creates Remaining Useful Life (RUL) target based on linear or piece-wise linear degradation models (details below).
`encode_categorical_columns`	Encodes a specified set of categorical columns.
`normalise`	Scaling columns into the same scale.

Currently, OctaiPipe natively supports the following encoders and scalers from scikit-learn:

One-hot encoder
Ordinal encoder
MinMax scaler
Standard scaler
Robust scaler

Config#

The following are examples of config files respectively for running the step once and periodically, together with descriptions of its parts.

Config example for running the step once#

name: preprocessing

input_data_specs:
  default:
  - datastore_type: influxdb
    settings:
      query_type: dataframe # influx/dataframe/stream/stream_dataframe/csv
      query_template_path: ./configs/data/influx_query.txt
      query_config:
        start: "2020-05-20T13:30:00.000Z"
        stop: "2020-05-20T13:35:00.000Z"
        bucket: sensors-raw
        measurement: cat
        tags: {}

output_data_specs:
  default:
  - datastore_type: influxdb
    settings:
      bucket: test-bucket-1
      measurement: testv1

run_specs:
  save_results: True
  # run_interval: 2s
  target_label: RUL
  label_type: "int"
  onnx_pred: False
  train_val_test_split:
    to_split: false
    split_ratio:
      training: 0.6
      validation: 0.2
      testing: 0.2

preprocessing_specs:
  steps: # full list of steps will be in technical documentation
    - interpolate_missing_data
    - remove_constant_columns
    - make_target
    - encode_categorical_columns
    - normalise
  degradation_model: linear # linear, pw_linear
  initial_RUL: 10
  var_threshold: 0
  preprocessors_specs:
    - type: onehot_encoder
      load_existing: False
      name: onehot_encoder_test0
      # version: "2.3"
      categorical_columns:
        - "cat_col1"
        - "cat_col2"
    - type: minmax_scaler
      load_existing: False
      name: scaler_test0
      # version: "1.1"

Config example for running the step periodically#

name: preprocessing

input_data_specs:
  default:
  - datastore_type: influxdb
    settings:
      query_type: dataframe
      query_template_path: ./configs/data/influx_query_periodic.txt
      query_config:
        start: 2m
        bucket: live-metrics
        measurement: live-raw
        tags: {}

output_data_specs:
  default:
  - datastore_type: influxdb
    settings:
      bucket: live-metrics
      measurement: live-processed

run_specs:
  save_results: True
  run_interval: 10s
  label_type: "int"
  onnx_pred: false

preprocessing_specs:
  steps: # full list of steps will be in technical documentation
    - interpolate_missing_data
    - remove_constant_columns
    # - make_target
    - encode_categorical_columns
    - normalise
  degradation_model: linear # linear, pw_linear
  initial_RUL: 10
  var_threshold: 0
  preprocessors_specs:
    - type: onehot_encoder
      load_existing: True
      name: onehot_encoder_test0
      version: "2.3"
      categorical_columns:
        - "cat-col1"
        - "cat-col2"
    - type: minmax_scaler
      load_existing: True
      name: scaler_test0
      version: "1.1"

Input and Output Data Specs#

input_data_specs and output_data_specs follow a standard format for all the pipeline steps; see Octaipipe Steps.

Run Specs#

run_specs:
 save_results: True
 run_interval: 10s

run_specs provide high level control of the preprocessing step run. Desciptions and options for the config items are given in the table below.

level 1	level 2	level 3	type/options	description
`save_results`			`bool`	if False, the step is flushed without saving any of the outputs, only use for testing.
`run_interval`			`str`	if this key is present, the step will be run periodcally at the specified interval. Value is given in minute (e.g. 2m), or in second (e.g. 10s).
`train_val_test_split`	`to_split`		`bool`	if True, the whole preprocessed dataset will be split into three sections according to the ratios below and saved with corresponding tags
	`split_ratio`	`training`	`float`	ratios defining the data split into training, validation and test, you can then use the tags for model training step subset. Note! The three floats must add up to 1.0
		`validation`	`float`
		`testing`	`float`

Preprocessing Specs#

preprocessing_specs:
   steps:
      - interpolate_missing_data
      - encode_categorical_columns
      - normalise
      - remove_constant_columns
      - make_target
   degradation_model: linear
   initial_RUL: 100
   var_threshold: 0
   preprocessors_specs:
      - type: onehot_encoder # Required with `encode_categorical_columns` step
        load_existing: False
        name: onehot_encoder_test0
        # version: "2.3"
        categorical_columns:
          - "cat_col1"
          - "cat_col2"
      - type: minmax_scaler  # Required with `normalise` step
        load_existing: False
        name: scaler_test0
        # version: "1.1"

preprocessing_specs provides control of the preprocessing steps to be performed.

`steps`#

The user specifies the steps to be used, out of those listed above.

Note

The execution order of the steps depends on the order of the steps list

`degradation_model` for `make_target`#

Note

This section is only applicable for run-to-failure datasets

A database containing multiple run-to-failure datasets can be used to build supervised learning models to predict remaining useful life (RUL). Supervised learning approaches require a target output set, containing the RUL at each timestep in the data. This is, however, not possible to know without a physical model of the system, and hence these datasets do not come with a target RUL output set so one needs to be built. A few common approaches are outlined below.

`linear` - Linear Degradation Model#

This approach follows the definition of RUL in the strictest sense, using the time remaining before failure (i.e. the remaining number of observations before the end of the dataset).

`pw_linear` - Piece-Wise Linear Degradation Model#

The above approach assumes that the system degrades linearly from the beginning. A more reasonable assumption would be that the system only starts to degrade after a period of time. This approach was first proposed in Recurrent neural networks for remaining useful life estimation for use on the C-MAPSS dataset, with the use of a piece-wise linear degradation model. This model sets an initial, maximum RUL value that the system exists at before linear degradation begins. This approach should provide a more realistic degradation profile and helps to prevent overestimating of RUL. It has been shown empirically to improve performance compared to the above approach for the CMAPSS dataset (Remaining useful life estimation in prognostics using deep convolution neural networks).

Piece-wise linear degradation demonstration

The above is sensor 7 from the dataset of the first engine unit in CMAPSS. As can be seen visually, the piece-wise linear degradation model is suitable.

`initial_RUL` for `pw_linear` Degradation Model#

Note

This section is only applicable for run-to-failure datasets

This is an integer value that defines the healthy useful lifetime of a system. That is, in the piece-wise linear plot above, the initial_RUL is the horizontal line set at a value of 50 time units.

`preprocessor_specs`#

preprocessors_specs:
   - type: onehot_encoder
      load_existing: False
      name: onehot_encoder_test0
      # version: "2.3"
      categorical_columns:
      - "cat_col1"
      - "cat_col2"
   - type: minmax_scaler
      load_existing: False
      name: scaler_test0
      # version: "1.1"

The user specifies a list of preprocessor objects— encoders and scalers— to be used.

The type is one of the following:

onehot_encoder
ordinal_encoder
minmax_scaler
standard_scaler
robust_scaler

Note

All fitted encoders and scalers are saved and version-controlled so that the same encoders or scalers can be loaded to transform data during evaluation or inference.

If the step encode_categorical_columns (normalise) is used, at least one encoder (scaler) must be specified, or an error will be thrown. If load_existing is true, the user must specify the preprocessor version to be used; if false, a new preprocessor will be fitted, used to transform the data and will be saved. When the preprocessing step is run periodically (typically in an inference pipeline), all the specified proceprocessors must have load_existing: true— we must use fitted preprocessors to transform data— or an error will be thrown.

Note that to scale the target label, the user needs to specify the name of the target label under run_specs as explained in the following section. The same encoder or scaler that is applied to the features will be used to fit and transform the target label.

Run Specs#

run_specs:
 save_results: True
 run_interval: 10s
 target_label: RUL
 onnx_pred: False

run_specs provide high level control of the preprocessing step run. Desciptions and options for the config items are given in the table below.

	type/options	description
`save_results`	`bool`	if False, the step is flushed without saving any of the outputs, only use for testing.
`run_interval`	`str`	if this key is present, the step will be run periodcally at the specified interval. Value is given in minute (e.g. 2m), or in second (e.g. 10s).
`target_label`	`str`	the target label in the dataset, if any. if `make_target` is used, it must be either null (empty) or `RUL`.
`onnx_pred`	`bool`	whether to use the onnx file of a fitted preprocessor to transform data. If false, the joblib preprocessor file is used.

Preprocessing Step#

steps#

Config#

Config example for running the step once#

Config example for running the step periodically#

Input and Output Data Specs#

Run Specs#

Preprocessing Specs#

steps#

degradation_model for make_target#

linear - Linear Degradation Model#

pw_linear - Piece-Wise Linear Degradation Model#

initial_RUL for pw_linear Degradation Model#

preprocessor_specs#

Run Specs#

`steps`#

`steps`#

`degradation_model` for `make_target`#

`linear` - Linear Degradation Model#

`pw_linear` - Piece-Wise Linear Degradation Model#

`initial_RUL` for `pw_linear` Degradation Model#

`preprocessor_specs`#