k-FED Clustering#

Introduction#

A major obstacle for machine learning in the industrial sector is the lack of labelled data for training supervised models. Data labelling may be impossible due to lack of data access, labour intensity and costs.

Therefore, a trustworthy, automated, unsupervised data labelling framework is crucial to unlock the potential for industrial IoT. Such frameworks often come in form of clustering, which can be used directly for anomaly detection, or to generate labels for training supervised models.

To this end, we have identified a state-of-the-art unsupervised federated clustering algorithm, the k-FED algorithm.

The k-FED algorithm#

The k-FED algorithm is a one-shot algorithm that builds on the k-means algorithm. K-means separates data points into clusters, where each data point belongs to the cluster where the distance to the cluster centroid (mean value in feature space) is the smallest. Below is an image showing the iterative process of setting centroids, assigning data points and updating centroids and cluster assignments until the model converges.

K-means iterative approach

The k-FED algorithm trains a k-means model at each device, using the k-means++ initialization. The cluster centroids of each local model are then transferred to the FL server, where a global k-means model is trained from the local centroids.

The resulting model has cluster separations based on the information from all participating devices.

In order to get local cluster assignments that are consistent across devices, the global model is sent back down to each device and the local cluster assignments are updated based on the global model. Each data points will be assigned to the global cluster where its local centroid is closest to the global centroid.

k-FED process

Setting the k parameter#

The key parameter to set in k-means clustering is the number of clusters, k. This assumes that we have some a priori knowledge of the data, being able to estimate how many clusters we expect to see.

The same parameter is present in k-FED, but it is somewhat more involved. In k-FED, the user sets two paramters k_local and k_global. These parameters specify the number of clusters that each local device sets as well as the number of clusters that the global model will have.

To anchor this to the real world, imagine we measure normal and failure states of machines in 5 separate locations. There might be a total of 10 failure states across the 5 locations. However, each individual location might only display 4 of these, with different states being present at different locations.

The k-FED algorithm allows local models to identify states at each location, and the global model then creates clusters representing the full distribution of states across locations.

It is worth noting that k_global has to be equal to or greater than k_local.

To further evaluate the k-FED model as well as the clusters it creates, you can use the OctaiPipe documentation on Evaluating k-FED.

Running k-FED in OctaiPipe#

Training a k-FED model#

Training a k-FED model is done through OctaiPipe’s FL framework, OctaiFL. The same code and configuration file as for a supervised FL run can be used. There are two main differences to supervised FL.

  1. Set the model specs for k-FED. This means setting the model type to kFED and model parameters for the local and global models

model_specs:
  name: my_kfed_model
  type: kFED
  model_params:
    local_params:
      n_clusters: 4
    global_params:
      n_clusters: 8
  1. No model evaluation will occur

This means that the evaluation data specs will not be used. We suggest setting the evaluation data specs to be the same as input data specs to avoid errors.

It also means that the target_label in the run_specs should be empty.

Below is sample run_specs for k-FED. Note that the backend is kmeans.

run_specs:
  backend: kmeans
  cycle_id:
  target_label:

Inference with k-FED#

There are two ways to run inference with the k-FED algorithm. Either continuously, making predictions at an interval, or just once, getting cluster identities for a specific dataset and then stopping.

Running once#

To run inference once and get the global cluster identities for a local dataset, set the model type in your inference config to kFED, and provide the name and version number.

You will also want to set the run_specs to disable batch prediction so that all data is loaded at once and set the max_iterations to 1, so that you stop running inference after having run the predictions once.

Example model_specs and run_specs are below:

model_specs:
  name: my_kfed_model
  type: kFED
  version: '5'

run_specs:
  disable_batch_prediction: true
  max_iterations: 1
  only_last_batch: false
  onnx_pred: false
  prediction_period: 30s
  save_results: true

Running continuously#

If you want to run inference continuously and get cluster predictions as data comes in, you will need to configure the run_specs slightly differently.

It is also worth noting that the initial local model will be created from the available data for that iteration. This means that if you have 5 local clusters, you will need at least 5 data points at each iteration.

An example config for continuous unsupervised inference is shown below:

model_specs:
  name: my_kfed_model
  type: kFED
  version: '5'

run_specs:
  only_last_batch: false
  onnx_pred: false
  prediction_period: 1m
  save_results: true