FL Train Step#
When you trigger an FL train step, a server is set up, which communicates with the FL clients, hosted on edge devices. On the edge devices, this is handled by an OctaiPipe Pipeline step (OctaiPipe Steps). called the FL Train Step. The following guide goes through the FL Train Step, showing the user how to configure it and how it can be extended using custom pipeline steps.
Methods in the FL Step#
The FL step inherits from the base PipelineStep in OctaiPipe. It uses PipelineStep as well as own methods to set up and run FL training. The following methods are implemented in the FL Train Step:
__init__
_check_data
_get_model
load_datasets
run
The __init__
method initializes the class by initializing the PipelinStep parent
class as well as checking the input and evaluation data specs using the _check_data
method. The model is also initialized using the _get_model
method.
The _check_data
method goes through the input_data_specs
and evaluation_data_specs
to find the field relevant to the device the FL step is running on. It returns the input_data_specs
fields for the device ID that the step is running on, or uses the values found for default
if not
device ID match is found.
The _get_model
method checks the model_specs
to see if the model type is in the
model mapping from the default OctaiPipe models. If not, it attempts to retrieve the
model from a local custom mapping or download it from blob storage.
The load_datasets
method uses the PipelineStep’s _load_data
method to first
laod the training data, then sets self._evaluation_data_specs
to self._input_data_specs
so that the _load_data
method can be used to get the test dataset. This method
gets called in the setup_loaders
method in the model. This is so that users can
define their own generators using custom models. For more information on how to use
custom models, check out the documentation on custom FL models, Custom PyTorch Model.
The run
method is the method that actually runs federated learning. It does so
by calling the setup_loaders
method in the model class, setting up the relevant
client for the framework, and running the client. The run method takes the server_ip
as an argument to hand to the client.
Configuring the FL train step#
Below is an example of the config file used to set up federated learning. For the
FL Train Step, the infrastructure field is not included. The run specs are popped
and given to the run
method and the rest are given to the method on initialization.
1name: federated_learning
2
3infrastructure:
4 server: kubernetes
5 backup_server: [deviceId]
6 device_ids: [FL-01, FL-02, FL-03, FL-04]
7
8input_data_specs:
9 devices:
10 - device: default
11 datastore_type: influxdb
12 query_type: dataframe
13 query_template_path: ./configs/data/influx_query_def.txt
14 query_values:
15 start: "2022-11-10T00:00:00.000Z"
16 stop: "2022-11-11T00:00:00.000Z"
17 bucket: cmapss-bucket
18 measurement: sensors-raw
19 tags: {}
20 - device: FL-01
21 datastore_type: influxdb
22 query_type: dataframe
23 query_template_path: ./configs/data/influx_query_1.txt
24 query_values:
25 start: "2022-11-10T00:00:00.000Z"
26 stop: "2022-11-11T00:00:00.000Z"
27 bucket: cmapss-bucket
28 measurement: sensors-raw
29 tags: {}
30 - device: FL-02
31 datastore_type: influxdb
32 query_type: dataframe
33 query_template_path: ./configs/data/influx_query_2.txt
34 query_values:
35 start: "2022-11-10T00:00:00.000Z"
36 stop: "2022-11-11T00:00:00.000Z"
37 bucket: cmapss-bucket
38 measurement: sensors-raw
39 tags: {}
40 - device: FL-03
41 datastore_type: influxdb
42 query_type: dataframe
43 query_template_path: ./configs/data/influx_query_3.txt
44 query_values:
45 start: "2022-11-10T00:00:00.000Z"
46 stop: "2022-11-11T00:00:00.000Z"
47 bucket: cmapss-bucket
48 measurement: sensors-raw
49 tags: {}
50 - device: FL-04
51 datastore_type: influxdb
52 query_type: dataframe
53 query_template_path: ./configs/data/influx_query_4.txt
54 query_values:
55 start: "2022-11-10T00:00:00.000Z"
56 stop: "2022-11-11T00:00:00.000Z"
57 bucket: cmapss-bucket
58 measurement: sensors-raw
59 tags: {}
60 data_converter: {}
61
62evaluation_data_specs:
63 devices:
64 - device: default
65 datastore_type: influxdb
66 query_type: dataframe
67 query_template_path: ./configs/data/influx_query_eval_def.txt
68 query_values:
69 start: "2022-11-10T00:00:00.000Z"
70 stop: "2022-11-11T00:00:00.000Z"
71 bucket: cmapss-bucket
72 measurement: sensors-raw
73 tags: {}
74 - device: FL-01
75 datastore_type: influxdb
76 query_type: dataframe
77 query_template_path: ./configs/data/influx_query_eval_1.txt
78 query_values:
79 start: "2022-11-10T00:00:00.000Z"
80 stop: "2022-11-11T00:00:00.000Z"
81 bucket: cmapss-bucket
82 measurement: sensors-raw
83 tags: {}
84 - device: FL-02
85 datastore_type: influxdb
86 query_type: dataframe
87 query_template_path: ./configs/data/influx_query_eval_2.txt
88 query_values:
89 start: "2022-11-10T00:00:00.000Z"
90 stop: "2022-11-11T00:00:00.000Z"
91 bucket: cmapss-bucket
92 measurement: sensors-raw
93 tags: {}
94 - device: FL-03
95 datastore_type: influxdb
96 query_type: dataframe
97 query_template_path: ./configs/data/influx_query_eval_3.txt
98 query_values:
99 start: "2022-11-10T00:00:00.000Z"
100 stop: "2022-11-11T00:00:00.000Z"
101 bucket: cmapss-bucket
102 measurement: sensors-raw
103 tags: {}
104 - device: FL-04
105 datastore_type: influxdb
106 query_type: dataframe
107 query_template_path: ./configs/data/influx_query_eval_4.txt
108 query_values:
109 start: "2022-11-10T00:00:00.000Z"
110 stop: "2022-11-11T00:00:00.000Z"
111 bucket: cmapss-bucket
112 measurement: sensors-raw
113 tags: {}
114 data_converter: {}
115
116model_specs:
117 type: base_torch
118 load_existing: false
119 name: test_torch
120 model_load_specs:
121 version: '000'
122 model_params:
123 loss_fn: mse
124 scaling: standard
125 metric: rmse
126 epochs: 10
127 batch_size: 32
128
129run_specs:
130 target_label: RUL
131 cycle_id: "Machine number"
132 backend: pytorch
The input_data_specs
and evaluation_data_specs
define the configuration for how
to get the training and evaluation data. The output_data_specs
are not used in
the current FL Train Step but can be used for saving any data if custom implementations
are wanted.
The model_specs
define which model to use, whether it be a native OctaiPipe model
or a custom model. Important here are the model_params
, which get handed to the
model on initialization. For the default PyTorch model, this includes things such
as number of epochs and, which loss function to use and the batch size.
The run_specs
as mentioned are passed to the run method. This requires the a
target_label
(outcome variable column name) to be defined. The cycle_id
here
that the training and validation sets contain data for a certain proportion of cycles
is the column which defines an operating cycle. The data can be grouped on this so
rather than a proportion of rows of data. The backend
is which FL client to
use. For example, for PyTorch this would be “pytorch”.
Making a custom FL train step#
In order to make a completely customized FL Train Step, the user can define a custom
pipeline step. This guide will not go through in detail how that is done, but it
is worth noting that a custom pipeline step needs to implement a run method which
initializes a client and starts it, linking it to the server_ip
from the run_specs
.
To implement a custom step, it is also important to understand any model class being used, whether it is a native OctaiPipe model or a custom model.
To get more information on custom OctaiPipe pipeline steps, see this guide: Custom Pipeline Steps
To further understand the base PyTorch model and to understand how to implement a custom PyTorch model, see this guide: Custom PyTorch Model