Federated Learning#

Federated learning, or FL, is a new machine-learning paradigm introduced by Google in 2016. It can effectively avoid privacy issues of the data. If one had to train a machine learning model in the past, we have to collect available data and train it in a centralised manner. With federated learning, edge devices upload their local training model to the server instead of uploading their data to a centralised location. The FL server aggregates these local models to create an aggregated global model. The high-level comparison between these two methods is illustrated using the figure below.

Typical FL training involves the following steps.

A machine-learning process triggers deployment of the FL server.
The FL Server randomly requests available devices or FL clients to perform machine learning processes using their own local data to create the own local model. These local models are initially based on the global model sent from the FL Server.
Once training is completed, these selected FL client push their local model to the FL server.
The FL server aggregates another version of the global model when receiving the required number of local models. The aggregation rule is based on the strategy defined by the user.
The FL server starts another round of training by randomly requesting available devices using the latest global model.
The FL training process terminates after a certain rounds of training.

From the process above, we may notice some keys to FL. These are

FL infrastructure, including devices and FL server.
The device selection algorithm.
The model aggregation algorithm.
The neural network model for this experiment.

You may notice FL can also consider machine learning as part of the pipeline, which makes it perfect for using Octaipipe.

To perform FL using Octaipipe, like other available machine learning such as AutoML, we use YML to define the FL infrastructure and its respective processes. Device selection and model aggregation are referred to as the “strategy” in Octaipipe, which we can define using a pythonic interface, normally via a jupyter notebook in OctaiLab. Similar to the strategy, one may use a predefined neural network model provided by OctaiPipe or define your own using a code-based pythonic method.

Since FL is one of Octaipipe’s features, you can easily conduct FL as one of many steps in the pipeline that Octaipipe offers. Hence, this user manual introduces how to perform FL using OctaiPipe. We start from the prerequisites needed when one would like to use FL with Octaipipe. After that, the required steps and their details are discussed. And we end with the programmer’s guide in related libraries.

Prerequisites#

Some essential steps have to complete before performing federated learning.

Collect and register device information with Octaipipe.
Prepare yaml configuration for FL and related pipeline.

Steps and Results#

To properly perform FL using Octaipipe, there are several essential processes. These processes will lead to the creation of a machine-learning model. All begin with the registration of the devices. After that, a definition file is required before deploying respective container images. Once the container image is deployed to each device, one can start the federated learning task with a simple command. The information of each process will be stored in the database, which is the summary report in Octaipipe. And as always, a machine-learning model will be created when the learning process is completed. In the following section, we will describe these in detail.

Device Registration#

This is the first step of any FL experiment using Octaipipe. Please refer to Register a Device for more details.

FL Server Deployment#

Once devices are registered in Octaipipe, we use “kubectl” under the hood to deploy the k8s server. This server plays an important role. It can deploy container images and the FL Server, which aggregates models from selected edge devices. Octaipipe has already provided this feature to create the server for you. All you need to do is to prepare a YAML configuration file and use octaifl.run() to deploy. You can turn log output off by setting the stream_logs argument to False, octaifl.run(stream_logs=False).

Define Pipeline Steps for FL#

After devices have been properly registered and initialised, the next step is to provide FL specifications. An FL specification can be described by its infrastructure, typical OctaiPipe configurations relating to input, output and training specs as well as the FL specific aggregation strategy. We will demonstrate how to perform federated learning in Octaipipe and being by discussing the YML below.

 name: federated_learning

 infrastructure:
   device_ids: [FL-01, FL-02, FL-03, FL-04]
   device_groups: [group_1]
   image_tag: 2.5.0
   env:
     ENV_VAR_0: value_0
     ENV_VAR_1: value_1

 strategy:
   fraction_fit: 1.0
   fraction_evaluate: 1.0
   min_fit_clients: 2
   min_evaluate_clients: 2
   min_available_clients: 2
   evaluate_metrics_aggregation_fn: weighted_average
   num_rounds: 10
   initial_model: None
   save_best_model: true

 input_data_specs:
   default:
     - datastore_type: influxdb
       settings:
         query_type: dataframe
         query_template_path: ./configs/data/influx_query_def.txt
         query_config:
           start: "2022-11-10T00:00:00.000Z"
           stop: "2022-11-11T00:00:00.000Z"
           bucket: cmapss-bucket
           measurement: sensors-raw
   FL-dev-01:
     - datastore_type: influxdb
       settings:
         query_type: dataframe
         query_template_path: ./configs/data/influx_query_def.txt
         query_config:
           start: "2022-11-10T00:00:00.000Z"
           stop: "2022-11-11T00:00:00.000Z"
           bucket: cmapss-bucket
           measurement: sensors-raw

 evaluation_data_specs:
   default:
     - datastore_type: influxdb
       settings:
         query_type: dataframe
         query_template_path: ./configs/data/influx_query_eval.txt
         query_config:
           start: "2022-11-10T00:00:00.000Z"
           stop: "2022-11-11T00:00:00.000Z"
           bucket: cmapss-bucket
           measurement: sensors-raw
   FL-dev-01:
     - datastore_type: influxdb
       settings:
         query_type: dataframe
         query_template_path: ./configs/data/influx_query_eval.txt
         query_config:
           start: "2022-11-10T00:00:00.000Z"
           stop: "2022-11-11T00:00:00.000Z"
           bucket: cmapss-bucket
           measurement: sensors-raw

 model_specs:
   type: base_torch
   load_existing: false
   name: test_torch
   model_load_specs:
     version: '000'
   model_params:
     loss_fn: mse
     scaling: standard
     metric: rmse
     epochs: 10
     batch_size: 32

 run_specs:
   target_label: RUL
   cycle_id: "Machine number"
   backend: pytorch

From the YAML above, an FL YAML consists of these blocks:

infrastructure
input_data_specs
evaluation_data_specs
model_specs
run_specs

Note

The output_data_specs block does not need to be set for Federated Learning Pipelines and will not do anything if it is included in the YML configuration.

One needs to define input_data_specs, evaluation_data_specs to describe the data for the FL experiment. OctaiPipe provides the feature to specify query config for each device individually by query_config['devices']. Hence, we would like to introduce the infrastructure, model_specs, and run_specs in the following sections.

You can validate your config using the validate_config_file function from the develop module of OctaiPipe’s Python Interface.

infrastructure#

From the high-level machine-learning life cycle point of view, there is no difference between centralised ML and FL. Hence, there should be no difference in the YML configuration. In fact, the only difference in YML between FL and typical ML is an additional block that has to be added to describe the infrastructure, which is not required for typical machine learning. An example of this infrastructure block is presented underneath.

infrastructure:
  device_ids: [FL-dev-01, FL-dev-02, FL-dev-03]
  device_groups: [group_1]
  image_tag: "3.0.3"
  env:
    ENV_VAR_0: value_0
    ENV_VAR_1: value_1

Key	Value Description	Note	Layer
`infrastructure`	Define the following block belongs to the the infrastructure	required field	1
`device_ids`	Define the FL clients of this experiment. It is a list-type variable contains edge device_ids.	optional field	2
`device_groups`	List of device groups to run FL on (will be be added to device ID list).	optional field	2
`image_tag`	Tag of image to use if specific (e.g. 3.0.3). Defaults to version of OctaiPipe in Jupyter Notebooks Server	optional field	2
`env`	Dictionary of environment variable names and values to set on device for experiment	optional field	2

model_specs#

A typical model_specs block for FL experiment is as shown below:

model_specs:
  type: base_pytorch
  load_existing: false
  name: test_torch
  model_load_specs:
    version: 3
  model_params:
    loss_fn: mse
    scaling: standard
    metric: rmse
    epochs: 10
    batch_size: 32
  custom_model:
    file_path: ./path/to/model.py

The difference to other types of ML experiment is one has to assign a custom_model file_path within this block.

run_specs#

This defines the target label, cycle id and FL Framework of the experiment. An example is shown below:

run_specs:
  target_label: RUL
  cycle_id: "Machine number"
  backend: pytorch

Key	Value Description	Note	Layer
`target_label`	The name of the label to use as the target for training the model	required	1
`backend`	The framework to use. See documentation on frameworks. If not provided and using an FL model build into OctaiPipe, backend can be inferred. Currently: ‘pytorch’, ‘sklearn’, ‘xgboost’	optional	1
`cycle_id`	The name of the column that identifies and operating cycle. The validation set is split grouped on this column if provided.	optional	1

strategy#

Octaipipe supports two federated learning (FL) strategies tailored to different types of machine learning models: one for PyTorch and scikit-learn models, and another for XGBoost models. Both of these are extensions of Flower’s built in flwr.server.strategy.FedAvg strategy. See Aggregation Strategies for more details.

These strategies incorporate specific defaults to ensure compatibility and optimal performance with their respective model types.

You will find Flower’s documentation on the built in strategy options of FedAvg in [their documentation](https://flower.ai/docs/framework/ref-api/flwr.server.strategy.FedAvg.html#fedavg). Of these only the following are supported in OctaiPipe:

For PyTorch and scikit-learn Models:#

fraction_fit (float, optional): The fraction of clients to use during training. If the specified minimum number of clients (min_fit_clients) is greater than the available fraction (fraction_fit * number of clients), then the minimum number of clients will still be sampled. The default value is 1.0.
fraction_evaluate (float, optional): The fraction of clients to use during validation. If the specified minimum number of clients (min_evaluate_clients) is greater than the available fraction (fraction_evaluate * number of clients), then the minimum number of clients will still be sampled. The default value is 1.0.
min_fit_clients (int, optional): The minimum number of clients to use during training. The default value is 2.
min_evaluate_clients (int, optional): The minimum number of clients to use during validation. The default value is 2.
min_available_clients (int, optional): The minimum total number of clients that must be available in the system. The default value is 2.
evaluate_metrics_aggregation_fn (str): Maps to a function for aggregating metrics during validation. Defaults to weighted average. Additional evaluation aggregation functions can be provided in the custom StrategyConfig class.
num_rounds (int, optional): The number of rounds to train for. The default value is 10.
round_timeout (float, optional): Time to wait until aggregation goes ahead regardless of number of clients that have sent model updates. Defaults to None, which means no timeout.

Additional Options for PyTorch and scikit-learn:

initial_model (str, optional): To initiate global model with an existing model, specify the model_id of the model here. Defaults to None.
save_best_model (bool, optional): Whether to save the model from the last or best (based on metric specified in model_specs) communication round. Defaults to False.

Defaults Options: * fraction_fit and fraction_evaluate set to 1.0: Fraction of Clients Participating in Training and Evaluation (100%) * min_fit_clients and min_evaluate_clients set to 2: Minimum Number of Clients required for both training and evaluation, with a total minimum of 2 available clients (min_available_clients) * evaluate_metrics_aggregation_fn: ‘weighted_average’: Evaluation Metrics Aggregation Function (Weighted average) * num_rounds: 10: Number of Training Rounds * round_timeout: None: No timeout, always wait for enough clients to aggregate * initial_model: None: Initial Model Weights (Can be specified; otherwise, defaults to initializing with random client model weights)

For XGBoost Models:#

fraction_fit (float, optional): The fraction of clients to use during training. If the specified minimum number of clients (min_fit_clients) is greater than the available fraction (fraction_fit * number of clients), then the minimum number of clients will still be sampled. The default value is 1.0.
fraction_evaluate (float, optional): The fraction of clients to use during validation. If the specified minimum number of clients (min_evaluate_clients) is greater than the available fraction (fraction_evaluate * number of clients), then the minimum number of clients will still be sampled. The default value is 1.0.
min_fit_clients (int, optional): The minimum number of clients to use during training. The default value is 2.
min_evaluate_clients (int, optional): The minimum number of clients to use during validation. The default value is 2.
min_available_clients (int, optional): The minimum total number of clients that must be available in the system. The default value is 2.
evaluate_metrics_aggregation_fn (str): Maps to a function for aggregating metrics during validation. Defaults to weighted average. Additional evaluation aggregation functions can be provided in the custom StrategyConfig class.
num_rounds (int, optional): The number of rounds to train for. The default value is 10.

Additional Options for PyTorch and scikit-learn:

num_local_rounds (int, optional): The number of local rounds of training to perform on the device. The default value is 1.
normalized_learning_rate (bool, optional): Whether to normalize the learning rate based on the number of samples each client contributes. The default value is False.

Default Strategy Options: * fraction_fit and fraction_evaluate set to 1.0: Fraction of Clients Participating in Training and Evaluation (100%) * min_fit_clients and min_evaluate_clients set to 2: Minimum Number of Clients required for both training and evaluation, with a total minimum of 2 available clients (min_available_clients) * evaluate_metrics_aggregation_fn: ‘weighted_average’: Evaluation Metrics Aggregation Function (Weighted average) * num_rounds: 10: Number of Training Rounds * num_local_rounds: 1: Number of Local Rounds (Indicating that each client will perform one round of training locally before aggregation) * normalized_learning_rate: False: Normalized Learning Rate (Disabled by default, but can be enabled to adjust the learning rate based on the number of samples each client contributes)

[See here](Customise Strategy Parameters) for home to customise a strategy:

Federated PyTorch Model#

Once the FL infrastructure and its strategy are defined, the next step is to define the neural network model. Octaipipe currently supports using PyTorch to define the neural network. In Octaipipe, one may use a code-based style to define your own PyTorch neural network model. Here, we provide the example shown below:

from octaipipe.model_classes.fl_aquarium.base_pytorch import BasePytorch
import torch.nn as nn
import torch.nn.functional as F

class CustomModel(BasePytorch):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
    def _build_model(self, input_shape, output_shape):
        '''Builds model when class is initialized.
        Args:
            input_shape: number of columns in X.
            output_shape: number of columns in y
        '''
        self.dense1 = nn.Linear(input_shape, input_shape * 2)
        self.drop1 = nn.Dropout(p=0.5)
        self.dense2 = nn.Linear(input_shape * 2, input_shape * 2)
        self.drop2 = nn.Dropout(p=0.5)
        self.dense3 = nn.Linear(input_shape * 2, output_shape)

    def forward(self, x):
        '''Defines how data is forward propagated'''
        x = F.relu(self.dense1(x))
        x = self.drop1(x)
        x = F.relu(self.dense2(x))
        x = self.drop2(x)
        x = self.dense3(x)
        return x

Deploy and perform FL#

Running FL works as below:

from octaipipe.federated_learning.run_fl import OctaiFL

FlYml = 'path to definition file'
octaiFl = OctaiFL(FlYml)
octaiFl.run()
# octaiFl.run(stream_logs=False) # to turn off log output

The FL infrastructure should be torn down automatically at the end of the run, but if you need to tear it down manually you can use the code below:

Or using a python code-based statement.

octaifl_teardown(deployment_id='<deployment-id>')

Summary Report and Detailed Experiment Info#

When an FL experiment starts, one may be eager to see how the FL experiment progresses. Octaipipe offers two APIs to retrieve this information. You may retrieve summary reports regarding the latest information or detailed information via the experiments submodule.

import pandas as pd
from octaipipe import experiments

experiments.get_experiment_by_id('2ecab1ab')

From the code above, one may discover details about the experiment once experiment_id is known. One may issue experiments.get_experiment_by_id() to get detailed information about a specific experiment.

Detailed experiment info provides complete experiment information within the experiment’s lifespan. If an experiment has ten rounds of communication, it will provide ten detailed experiment records.
Detailed experiment info provides the following information. * experimentId * experimentDescription * date * userId * communicationRound * currentStatus * createDatetime * experimentStatus * flConfigFile * flServer * flStrategy

Machine Learning Model#

Once trained and the experiment is complete, one may be keen to use your model for inference. Octaipipe also provides a models submodule. You may retrieve the model by the experimentId or modelId. Here is example code to retrieve the model. Once the model information is retrieved, one may use these to perform model inference. We suggest using the model name and its version number while performing inference.

from octaipipe import models

models.find_models_by_experiment_id('2ecab1ab')

Generally, each communication round will create an updated model. Octaipipe also provides an API to retrieve evaluation information for these models. All you need to do is to use the EvaluationClient API. Beneath is example code to get the model evaluation information by its modelId.

import pandas as pd
from octaipipe_core.client.evaluation_client import EvaluationClient

ev_client = EvaluationClient()
evaluation = ev_client.get_evaluation_model_id('16826f33')
pd.DataFrame(evaluation).sort_values('communicationRound')

Federated Learning Details#

If you want to understand the FL libraries that Octaipipe provides, please refer to the following sections for more detail.