Intermittent Devices#

A common problem in Federated Learning is that clients involved in model training are unavailable for portions of the training.

For example, due to a temporary network outage a portion of the devices in the experiment go offline for 10 minutes. This means that the aggregation of the global model cannot go ahead, as not all clients provide the server with their local model updates.

In order to help with situations like this, OctaiPipe has implemented three configurations that make experiments more robust to intermittent devices.

Below are the recommendations and how to configure them in OctaiPipe.

Fractional Participation#

One way to allow the FL server to run global model aggregation without all clients being available is to set up fractional participation for the experiment. This configures the FL server to go ahead with fitting the global model (or running evaluation) with only a fraction of the available clients.

Below is an example, where the FL strategy is set with fraction_fit and fraction_evaluate, which allows a fraction of clients to be available before the server goes ahead to fit or evaluate a global model.

 1from octaipipe.federated_learning.run_fl import OctaiFL
 2
 3config_path = 'path to definition file'
 4octaifl = OctaiFL(config_path)
 5
 6strategy = {
 7    'min_available_clients': 5,
 8    'fraction_fit': 0.8,
 9    'fraction_evaluate': 0.8
10}
11
12OctaiFL.strategy.set_config(strategy)
13
14OctaiFL.run()

Smaller batches#

The next recommendation is only valid for neural network models. It involves setting a lower number of local epochs and a higher number of FL communication rounds for the experiment. If each round takes a shorter length of time to complete, the server has to wait less for stragglers.

Fewer local epochs can be set in the model_params section of the model_specs in the federated learning config. See an example below of local epochs set to 5 in the model_params.

 1model_specs:
 2  type: base_torch
 3  load_existing: false
 4  name: test_torch
 5  model_load_specs:
 6    version: 3
 7  model_params:
 8    loss_fn: mse
 9    scaling: standard
10    metric: rmse
11    epochs: 5
12    batch_size: 32

To then set the number of communication rounds to 50 instead of the default (10), set the FL strategy as below:

 1from octaipipe.federated_learning.run_fl import OctaiFL
 2
 3config_path = 'path to definition file'
 4octaifl = OctaiFL(config_path)
 5
 6strategy = {
 7    'num_rounds': 50
 8}
 9
10OctaiFL.strategy.set_config(strategy)
11
12OctaiFL.run()

Round timeout#

The final configuration for intermittent devices is to set a maximum timeout for a communication round in the experiment. If the FL server waits more than the configured number of seconds to receive results from a client, instead of stalling, the clients that have not contributed results are considered failed and will be excluded from the experiment.

While this does not directly help the experiment wait or account better for intermittent devices, it helps the experiment go ahead without devices that wait too long to connect. If a timeout it set, and we set for example fraction_fit to 0.7, the experiment can go ahead even if 30% of devices timeout and cannot participate.

The round timeout can be set as below:

 1from octaipipe.federated_learning.run_fl import OctaiFL
 2
 3config_path = 'path to definition file'
 4octaifl = OctaiFL(config_path)
 5
 6strategy = {
 7    'round_timeout': 60  # number of seconds to wait until timeout
 8}
 9
10OctaiFL.strategy.set_config(strategy)
11
12OctaiFL.run()