Intermittent Devices#

A common problem in Federated Learning is that clients involved in model training are unavailable for portions of the training. This can happen due to temporary network outages, device restarts, or other factors causing clients to go offline during training. When clients disconnect, the aggregation of the global model can be delayed or even fail, as the server may not receive the necessary updates from all clients. The Flower server which aggregates client responses will exclude a device from an experiment if it disconnects during a round it was participating in, even if the device later connects.

To help with situations like this, OctaiPipe provides configurations that make experiments more robust to intermittent client availability. By adjusting these settings, you can reduce the likelihood of experiment failure due to client disconnections.

Below are the recommendations and how to configure them in OctaiPipe.

Fractional Participation#

One way to allow the FL server to proceed with global model aggregation without all clients being available is to set up fractional participation for the experiment. This configures the FL server to proceed with fitting the global model (or running evaluation) with only a fraction of the connected clients.

The single most impactful configuration for mitigating intermittent devices is a low min_available_clients setting. This setting allows the server to proceed with the round even if only a small fraction of clients are available. Depending on the severity of the intermittency, it can be helpful to set this value to a value as low as 20% of the total number of devices.

Alongside this, fraction_fit and fraction_evaluate define the fraction of connected clients (i.e. it is not the fraction of all devices in the experiment) that are required to participate in each round of training and evaluation respectively.

Here is an example of how to configure the FL strategy for fractional participation:

from octaipipe.federated_learning.run_fl import OctaiFL

config_path = 'path to definition file'
octaifl = OctaiFL(config_path)

strategy = {
    'min_available_clients': 20  # Out of a total of 100 devices
    'fraction_fit': 0.4,
    'fraction_evaluate': 0.2
}

OctaiFL.strategy.set_strategy_values(strategy)

OctaiFL.run()

By reducing fraction_fit and fraction_evaluate, you increase the chance that the server can proceed even if some clients are unavailable. However, it’s important to note that setting these values too low may affect the quality of the global model, as fewer client updates are used for aggregation.

In the above example, the round will not commence until at least 20 devices have connected. If 25 devices connect, 10 (25*0.4) will be required to take part in fitting the model (i.e. training) and 5 (25*0.2) will be required to take part in evaluation.

Larger Batches and Fewer Local Epochs#

Another recommendation is to configure your model training to have shorter local training durations. This can be achieved by using larger batch sizes and fewer local epochs. Larger batch sizes allow the same amount of data to be processed in fewer iterations, reducing the training time per epoch. This speeds up each round, reducing the chance that a client will disconnect during training, making the experiment more resilient to intermittency.

Below is an example of how to adjust the model_params section of the model_specs in your federated learning configuration to set fewer local epochs and use larger batch sizes:

model_specs:
  type: base_torch
  load_existing: false
  name: test_torch
  model_load_specs:
    version: 3
  model_params:
    loss_fn: mse
    scaling: standard
    metric: rmse
    epochs: 2          # Fewer local epochs
    batch_size: 64     # Larger batch size (be mindful of device capacity)

Note

Be mindful of device capacity when increasing batch sizes. Devices with limited memory may not be able to handle very large batch sizes.

By reducing the number of local epochs and increasing batch sizes, each round takes less time to complete, reducing the window during which clients might disconnect.

Increase Number of Rounds#

Since attempts to mitigate against device intermittency can impact model performance, you might need to compensate by increasing the number of communication rounds. This allows the model to converge over more aggregated updates.

You can set the number of communication rounds in the FL strategy configuration:

from octaipipe.federated_learning.run_fl import OctaiFL

config_path = 'path to definition file'
octaifl = OctaiFL(config_path)

strategy = {
    'num_rounds': 50
}

OctaiFL.strategy.set_strategy_values(strategy)

OctaiFL.run()

Additional Considerations#

In intermittent scenarios, even with appropriate settings, the experiment may still fail if too many clients disconnect during a round. This is because the FL server is designed to proceed when the minimum threshold of participating clients is met. However, if additional clients disconnect mid-round, the number of participating clients may drop below the required minimum, causing the round to fail.

To mitigate this, consider:

Increasing the initial pool of connected clients: Encourage as many clients as possible to connect at the start of each round, increasing the buffer for potential disconnections.
Monitoring client availability: Keep track of which clients frequently disconnect and consider excluding them from the experiment if they are unreliable.
Adjusting min_available_clients: Setting min_available_clients to a lower value increases the chance that the server can proceed, but may affect model quality due to fewer updates.
Understanding the trade-offs: Speeding up rounds by reducing local epochs and increasing batch sizes can make the experiment more resilient to intermittency but may require more rounds to achieve the desired model performance.

Rule of Thumb: The faster the rounds, the more resilient the experiment will be to intermittency. However, speeding up rounds requires configuration changes that might degrade model performance. To counterbalance this, other settings such as increasing the number of rounds might be necessary.

Remember that Federated Learning in environments with intermittent device availability can be challenging, and configuring your experiment to be resilient involves balancing trade-offs between model performance and robustness to disconnections.

By carefully adjusting the above configurations, you can improve the resilience of your Federated Learning experiments in the face of intermittent client availability.