FL Train Step#

When you trigger an FL train step, a server is set up, which communicates with the FL clients, hosted on edge devices. On the edge devices, this is handled by an OctaiPipe Pipeline step (OctaiPipe Steps). called the FL Train Step. The following guide goes through the FL Train Step, showing the user how to configure it and how it can be extended using custom pipeline steps.

Methods in the FL Step#

The FL step inherits from the base PipelineStep in OctaiPipe. It uses PipelineStep as well as own methods to set up and run FL training. The following methods are implemented in the FL Train Step:

__init__
_get_model
load_datasets
run

The __init__ method initializes the class by initializing the PipelinStep parent class as well as checking the input and evaluation data specs using the _check_data method. The model is also initialized using the _get_model method.

The _get_model method checks the model_specs to see if the model type is in the model mapping from the default OctaiPipe models. If not, it attempts to retrieve the model from a local custom mapping or download it from blob storage.

The load_datasets method uses the PipelineStep’s _load_data method to first load the training data, then test dataset. This method gets called in the setup_loaders method in the model. This is so that users can define their own generators using custom models. For more information on how to use custom models, check out the documentation on custom FL models, Federated PyTorch.

The run method is the method that actually runs federated learning. It does so by calling the setup_loaders method in the model class, setting up the relevant client for the framework, and running the client. The run method takes the server_ip as an argument to hand to the client.

Configuring the FL train step#

Below is an example of the config file used to set up federated learning. For the FL Train Step, the infrastructure field is not included. The run specs are popped and given to the run method and the rest are given to the method on initialization.

name: federated_learning

infrastructure:
  server: kubernetes
  backup_server: [deviceId]
  device_ids: [FL-01, FL-02, FL-03, FL-04]

input_data_specs:
  default:
    - datastore_type: influxdb
      settings:
        query_type: dataframe
        query_template_path: ./configs/data/influx_query_def.txt
        query_config:
          start: "2022-11-10T00:00:00.000Z"
          stop: "2022-11-11T00:00:00.000Z"
          bucket: cmapss-bucket
          measurement: sensors-raw
          tags: {}
  FL-01:
    - datastore_type: influxdb
      settings:
        query_type: dataframe
        query_template_path: ./configs/data/influx_query_1.txt
        query_config:
          start: "2022-11-10T00:00:00.000Z"
          stop: "2022-11-11T00:00:00.000Z"
          bucket: cmapss-bucket
          measurement: sensors-raw
          tags: {}
  FL-02:
    - datastore_type: influxdb
      settings:
        query_type: dataframe
        query_template_path: ./configs/data/influx_query_2.txt
        query_config:
          start: "2022-11-10T00:00:00.000Z"
          stop: "2022-11-11T00:00:00.000Z"
          bucket: cmapss-bucket
          measurement: sensors-raw
          tags: {}
  FL-03:
    - datastore_type: influxdb
      settings:
        query_type: dataframe
        query_template_path: ./configs/data/influx_query_3.txt
        query_config:
          start: "2022-11-10T00:00:00.000Z"
          stop: "2022-11-11T00:00:00.000Z"
          bucket: cmapss-bucket
          measurement: sensors-raw
          tags: {}
  FL-04:
    - datastore_type: influxdb
      settings:
        query_type: dataframe
        query_template_path: ./configs/data/influx_query_4.txt
        query_config:
          start: "2022-11-10T00:00:00.000Z"
          stop: "2022-11-11T00:00:00.000Z"
          bucket: cmapss-bucket
          measurement: sensors-raw
          tags: {}

evaluation_data_specs:
  default:
    - datastore_type: influxdb
      settings:
        query_type: dataframe
        query_template_path: ./configs/data/influx_query_eval_def.txt
        query_config:
          start: "2022-11-10T00:00:00.000Z"
          stop: "2022-11-11T00:00:00.000Z"
          bucket: cmapss-bucket
          measurement: sensors-raw
          tags: {}
  FL-01:
    - datastore_type: influxdb
      settings:
        query_type: dataframe
        query_template_path: ./configs/data/influx_query_eval_1.txt
        query_config:
          start: "2022-11-10T00:00:00.000Z"
          stop: "2022-11-11T00:00:00.000Z"
          bucket: cmapss-bucket
          measurement: sensors-raw
          tags: {}
  FL-02:
    - datastore_type: influxdb
      settings:
        query_type: dataframe
        query_template_path: ./configs/data/influx_query_eval_2.txt
        query_config:
          start: "2022-11-10T00:00:00.000Z"
          stop: "2022-11-11T00:00:00.000Z"
          bucket: cmapss-bucket
          measurement: sensors-raw
          tags: {}
  FL-03:
    - datastore_type: influxdb
      settings:
        query_type: dataframe
        query_template_path: ./configs/data/influx_query_eval_3.txt
        query_config:
          start: "2022-11-10T00:00:00.000Z"
          stop: "2022-11-11T00:00:00.000Z"
          bucket: cmapss-bucket
          measurement: sensors-raw
          tags: {}
  FL-04:
    - datastore_type: influxdb
      settings:
        query_type: dataframe
        query_template_path: ./configs/data/influx_query_eval_4.txt
        query_config:
          start: "2022-11-10T00:00:00.000Z"
          stop: "2022-11-11T00:00:00.000Z"
          bucket: cmapss-bucket
          measurement: sensors-raw
          tags: {}

model_specs:
  type: base_torch
  load_existing: false
  name: test_torch
  model_load_specs:
    version: '000'
  model_params:
    loss_fn: mse
    scaling: standard
    metric: rmse
    epochs: 10
    batch_size: 32

run_specs:
  target_label: RUL
  cycle_id: "Machine number"
  backend: pytorch

The input_data_specs and evaluation_data_specs define the configuration for how to get the training and evaluation data. The output_data_specs are not used in the current FL Train Step but can be used for saving any data if custom implementations are wanted.

The model_specs define which model to use, whether it be a native OctaiPipe model or a custom model. Important here are the model_params, which get handed to the model on initialization. For the default PyTorch model, this includes things such as number of epochs and, which loss function to use and the batch size.

The run_specs as mentioned are passed to the run method. This requires the a target_label (outcome variable column name) to be defined. The cycle_id here that the training and validation sets contain data for a certain proportion of cycles is the column which defines an operating cycle. The data can be grouped on this so rather than a proportion of rows of data. The backend is which FL client to use. For example, for PyTorch this would be “pytorch”.

Making a custom FL train step#

In order to make a completely customized FL Train Step, the user can define a custom pipeline step. This guide will not go through in detail how that is done, but it is worth noting that a custom pipeline step needs to implement a run method which initializes a client and starts it, linking it to the server_ip from the run_specs.

To implement a custom step, it is also important to understand any model class being used, whether it is a native OctaiPipe model or a custom model.

To get more information on custom OctaiPipe pipeline steps, see this guide: Custom Pipeline Steps

To further understand the base PyTorch model and to understand how to implement a custom PyTorch model, see this guide: Federated PyTorch