OctaiPipe Models#

OctaiPipe provides a range of Machine Learning algorithms to solve classification, regression, and clustering problems. OctaiPipe has implementations for the most common and popular models in Scikit-learn, but also models from XGBoost, statsmodels, and LightGBM. However, we also provide you with a way to solve the Remaining Useful Life prediction of devices in an IoT environment as well as Anomaly Detection in Time Series Sensor data. With OctaiPipe, you can also build your own custom model which we will explain as well as part of this article.

Below we explain the available models in OctaiPipe:

Supported Models#

Linear Regression#

In OctaiPipe configs: ‘ex_lin_reg’ This class implements the ordinary least squares Linear Regression. It is built on top of sklearn’s LinearRegression class. More details about the implementation can be found in this link.

Ridge Regression#

In OctaiPipe configs: ‘ridge_reg’ Ridge regression class. In ridge regression, a penalty term is added to the cost function in training to penalise large weight values that cause overfitting. More details about the implementation can be found in this link.

Elastic Net Regressor#

In OctaiPipe configs: ‘enet_reg’ The Elastic Net Regression class builds on the sklearn-implementation found at this link. Elastic Net is a linear regression with combined l1 and l2 regularizers.

Random Forest Regression#

In OctaiPipe configs: ‘random_forest_reg’ Random forest regression class. A random forest is an ensemble of decision trees, whereby predictions are taken as the average of each decision tree. More details about the implementation can be found in this link.

Tweedie Regressor#

In OctaiPipe configs: ‘tweedie_reg’ The Tweedie Regression class builds on the sklearn-implementation found at this link. It builds on Generalized Linear Regression, where the distribution of the output parameter is defined by the power argument.

Quantile Regressor#

In OctaiPipe configs: ‘quantile_reg’ The Quantile Regression class builds on the sklearn-implementation found at this link. Quantile regression uses Mean Absolute Deviations to predict a quantile (commonly median) output, as opposed to the mean output for linear regression. This can be useful for getting confidence intervals for a continuous prediction if for example the 5th, 50th and 95th quantile is predicted.

Support Vector Regressor#

In OctaiPipe configs: ‘svr’ The Support Vector Regression class builds on the sklearn-implementation found at this link. The support vector regressor is based on Support Vector Machines. This implementation works best for samples below 10,000. For larger samples, consider using SGD regression.

Stochastic Gradient Descent Regressor#

In OctaiPipe configs: ‘sgd_reg’ The SGD Regression class builds on the sklearn-implementation found at this link. The model tries to fit to the data by gradient descent where the model is updated stepwise to find a local minimum.

Bayesian Ridge Regressor#

In OctaiPipe configs: ‘bayes_ridge_reg’ The SGD Regression class builds on the sklearn-implementation found at this link. Bayesian Ridge regression is a conditional model where the mean of the output variable is described by the inputs and the model aims at finding the posterior probability of the regression coefficients.

ARD Regressor#

In OctaiPipe configs: ‘ard_reg’ The Automatic Relevance Determination Regression class builds on the sklearn-implementation found at this link. The ARD regressor is a Bayesian Ridge regression but with an ARD prior. The ARD regressor is often stronger than Bayesian Ridge Regression out the box.

Gradient Boosting Regressor#

In OctaiPipe configs: ‘gbm_reg’ The Gradient Boosting Regression class builds on the sklearn-implementation found at this link. Gradient boosting regression is a tree based model that improves the fit in an iterative fashion by combining the model at stage n with a penalized version of the previous models.

Extreme Gradient Boosting Regressor#

In OctaiPipe configs: ‘xgb_reg’ The XGB Regression class builds on the sklearn-implementation found at this link. XGBoost is an optimized gradient booster which is designed to be efficient and flexible.

Skopt for LightGBM#

In OctaiPipe configs: ‘skopt_lgbm_reg’ Tuning the hyper-parameters of a machine learning model is often carried out using an exhaustive exploration of the space of all hyper-parameter configurations for an estimator such as an GridSearch, which often results in a very time consuming operation. scikit-optimize allows a sequential model-based optimization using a Bayesian approach which we use to tune our LightGBM model. Refer to this link for more details.

LightGBM Regressor#

In OctaiPipe configs: ‘lgb_reg_sk’ Refer to the following this link for more details about the implemented model.

Logistic Regression#

In OctaiPipe configs: ‘log_reg’ Logistic regression is a linear model where the model is passed through a logit function, bounding the output between 0 and 1. For the sklearn implementation, follow this link.

k-Nearest Neighbors Classification#

In OctaiPipe configs: ‘knn_class’ k-Nearest Neighbors, or kNN, is an algorithm that classifies data points based on the majority class of the k nearest neighbors of the data point. More details on the Sklearn implementation can be found here.

Support Vector Classification#

In OctaiPipe configs: ‘svc’ The Support Vector Classification class builds on the sklearn-implementation found at this link. The support vector classification is based on Support Vector Machines. This implementation works best for samples below 10,000. Above that, other models are favoured.

Random Forest Classification#

In OctaiPipe configs: ‘random_forest_class’ Random forest classification class. A random forest is an ensemble of decision trees, whereby predictions are taken as the average of each decision tree. More details about the implementation can be found in this link.

Extreme Gradient Boosting Classifier#

In OctaiPipe configs: ‘xgb_class’ The XGB Classification class builds on the sklearn-implementation found at this link. XGBoost is an optimized gradient booster which is designed to be efficient and flexible.

LightGBM Classification#

In OctaiPipe configs: ‘lgb_class_sk’ Refer to the following this link for more details about the implemented model.

KMeans Clustering#

In OctaiPipe configs: ‘kmeans’ Checkout the sklearn documentation for more details.

Exponential Smoothing#

In OctaiPipe configs: ‘exp_smooth’ Exponential smoothing is a time series model for univariate data that allows for handling trends and seasonality. More information can be found here.

Principal Component Analysis#

In OctaiPipe configs: ‘pca’ Principal Component Analysis is a dimensionality reduction technique used to reduce a dataset of high dimensionality (large number of input features) to a smaller number of orthogonal features explaining the highest proportion of variance. More information can be found here.

Custom Models#

With OctaiPipe, you can define your own ML model. To create custom model class, you should inherit from the base class Model. Currently custom models have to be defined within custom pipeline steps. Each custom model has to implement a _build_new method which assigns the model an estimator, e.g. this implementation using LinearRegression from sklearn:

self.estimator = LinearRegression(**kwargs)

It also needs to implement a _to_onnx method which converts self.estimator to onnx, e.g. this implementation converting an sklearn model using convert_sklearn from skl2onnx and the data_schema attribute:

self.onnx_estimator = convert_sklearn(self.estimator, initial_types=self.data_schema)

Model Management#

After training a model, it is desirable to have a way to persist the model for future use without having to retrain. The following sections give you some hints on how to persist an OctaiPipe model.

Each time we run the Model Training Step, the model file will be saved to both locally and to the cloud. The current Implementation of OctaiPipe provides two formats to save the trained model.

ONNX (Open Neural Network Exchange Format) which is a format designed to represent any type of Machine Learning and Deep Learning model. Some examples of supported frameworks are: PyTorch, TensorFlow, Keras and many more. In this way, ONNX can make it easier to convert models from one framework to another.

Serialized Python Objects The model is saved using joblib.dump() and joblib.load() which provide a replacement for pickle to work efficiently on arbitrary Python objects such as model files.

Model Versioning#

The ML model versioning convention is aimed to help the user identify, given models of the same name, which of them are compatible with the data fed to it, be it data for re-training or for inference. The model version is a string of the form “X.Y”, where X is the major version, and the Y is the minor version, e.g. “1.2”. The major version characterises the following model metadata: type, solutionType, nClasses, targetLabel, columnNames. Models with the same name and same major version have the same metadata for these fields. They are differentiated by the minor version, which increments every time a new model is trained.

See Model Training Step for more detail on model specifications.