Data Drift Monitoring Step#

To run locally: data_drift

A machine learning model is trained by data gathered over a certain period of time. The predictive performance of the model relies on the assumption that the test data fed to the model to make predictions has the same statistical properties as the training set. If the statistical properties change with time in unforeseen ways, model predictions could become less accurate.

OctaiPipe’s DataDriftMonitoring step calculates metrics related to data drift. Given a trained ML model of a specified model id, this step first finds the training configuration of the model and takes the original training data as the benchmark. It then periodically reads the same data— but from different time ranges— from the database. It then performs the two-sample Kolmogorov-Smirnov (KS) test to compare the distributions of the original training data and the new data. The output of the step are the KS statistic and the p-value.

An example config file is given as follows.

Step config example#

name: data_drift

model_id: aaa56737

output_data_specs:
  default:
  - datastore_type: influxdb
    settings:
      bucket: sensors-out
      measurement: drift
  
run_specs:
  query_step: 2m
  run_interval: 5s

model_id: the model id.

Input and Output Data Specs#

input_data_specs and output_data_specs follow a standard format for all the pipeline steps; see Octaipipe Steps.

Run Specs#

run_specs provide high level control of the preprocessing step run.

query_step: time interval in minute, e.g. 2m. At a given time, data from the previous time interval of query_step is queried for data drift calculation against the original training data.
run_interval: given either in minute, e.g. 2m, or in second, e.g. 5s. The time interval at which data drift calculation is performed.