Active Learning#

Manually labelling a dataset for training a supervised model can be costly, as it often means doing so on a large volume of data. In industrial IoT use cases, data is often localised at the edge and cannot be transferred to a central server, making manual labelling even more challenging, if at all possible. The localisation of data often requires a federated approach.

Several approaches can be taken to address the challenge of data labelling in a federated way, by either automating labelling, or assisting in manual labelling to save manual costs.

OctaiPipe comes with a federated clustering algorithm, called k-FED Clustering, which enables the user to assign consistent cluster label to data on unlabelled edge datasets (see Edge Labelling). The cluster label can then be assigned physical meanings and used as target label for a supervised model. This unsupervised method can be used whether the dataset contains any labelled data.

In the case where the dataset is partially labelled, one can adopt active learning, which aims to augment an existing labelled dataset to be used to train a supervised model down the pipeline. In active learning, only a small subset of the unlabelled instances is selected for manual labelling by an ‘oracle’ (e.g. a human expert). The aim is for such data subset to include the most diverse, representative unlabelled samples with minimal redundancy in the information they contain, so that when added to the labelled set, they will maximise improvement on the supervised model performance. Therefore, active learning can greatly reduce the cost of manual labelling with boosted benefit.

The selection of a small unlabelled subset to label in active learning can be automated using a clustering algorithm, as follows. Upon clustering unlabelled data, diverse, representative samples are selected from each cluster— typically a few samples from the cluster centre, and the most uncertain samples within the cluster (e.g. those near the clustering boundaries). The samples are then labelled by an oracle and added to the training set for a supervised model.

OctaiPipe offers the necessary functionality which allows the user to implement federated active learning, thanks to our federated (supervised) learning and k-FED (unsupervised) federated clustering implementations. A suggested iterative workflow is given as follows:

  1. Each edge client has a local dataset that stays on the edge, which consists of a labelled subset and an unlabelled subset.

  2. The user runs OctaiPipe’s k-FED Clustering over the local unlabelled data subsets, after which each local unlabelled instance receives a global cluster assignment

  3. The user writes and runs an OctaiPipe custom step (see Guide - Custom Pipeline Steps), which on each edge client and for each global cluster, selects a specified number or fraction of local samples closest, and farthest, to the cluster centroid. These diverse representative data subsets are then flagged for manual labelling

  4. The newly labelled instances are appended to the original labelled instances. The user runs OctaiFL (see Supervised Federated Learning) on the resulting enlarged labelled datasets to train a supervised model with federated learning

By using of federated clustering, we are able to select groups instances that are different from each other - they come from different clusters - for manual labelling; this adds diversity to the selected instances.

References#

Albayati et al. Semi-Supervised Machine Learning for Fault Detection and Diagnosis of a Rooftop Unit. https://ieeexplore.ieee.org/document/10026516

Dennis, Li and Smith. Heterogeneity for the Win: One-Shot Federated Clustering. https://arxiv.org/abs/2103.00697

Abraham and Dreyfus-Schmidt. Sample Noise Impact on Active Learning. https://arxiv.org/abs/2109.01372