Tutorial - Running FL XGBoost with OctaiPipe#

In this tutorial I will take you through a standard Federated Learning deployment using XGBoost tree models. The data used is a small subset of the higgs dataset which can be downloaded here. The data in this tutorial will be static (i.e. will not update throughout training).

We will go through the following steps:

  1. Add your devices to the config file

  2. Download and split data

  3. Write data to devices

  4. Setting up the FL experiment

  5. Running the FL experiment

Step 1 - Add your devices to the config file#

Open the config file xgboost_federated_learning.yaml in configs/.

In the device section, enter the name of the devices you want to take part in the experiment. for example:

device_ids: ['device_0', 'device_1']

Take a minute to familiarise yourself with how to config looks like. Feel free to play with this file, to try out different configurations of input/output specs, model params, and strategies.

There is also a xgboost_federated_learning_template_do_not_edit.yaml file, where you can find the original config if you want to find it again. Please do not edit this file, so you won’t lose the original config.

[ ]:
import yaml
# Display devices from federated learning config, to make sure devices are set correctly
with open("configs/xgboost_federated_learning.yaml", 'r') as file:
    fl_config = yaml.safe_load(file)

devices = fl_config['infrastructure']['device_ids']

for spec, dataset in [('input_data_specs', 'train'), ('evaluation_data_specs', 'test')]:
    fl_config[spec] = {
        device_id: [
            {
                'datastore_type': 'local',
                'settings': {
                    'query_type': 'csv',
                    'query_config': {
                        'filepath_or_buffer': f'/tmp/higgs_{dataset}_data_{device_id}.csv',
                        'header': 0
                    }
                }
            }
        ]
        for device_id in devices
    }

with open("configs/xgboost_federated_learning.yaml", 'w') as file:
    yaml.safe_dump(fl_config, file, sort_keys=False)

print(f'You have set devices: {devices}')

Step 2 - Download and split data#

Here the higgs dataset is split into n chunks where n is the number of target devices.

[ ]:
# Download the dataset, split it and send to the devices
import pandas as pd
import numpy as np

assert len(devices), "Please update the config file with the list of devices you'll be running on. Then run the python cell above, to extract the device lists. "
! mkdir -p configs/datasets/devices/
! mkdir -p /tmp/datasets/
! wget -c https://octaipipe.blob.core.windows.net/higgs-dataset/higgs_data.tar.gz -P /tmp/datasets
! tar -xvf /tmp/datasets/higgs_data.tar.gz -C /tmp/datasets

# Load and randomly sample 20% of the rows
train_data = pd.read_csv('/tmp/datasets/higgs_train_data.csv').sample(frac=0.3, random_state=42)
test_data = pd.read_csv('/tmp/datasets/higgs_test_data.csv').sample(frac=0.3, random_state=42)

print('\nSplitting data into chunks for each device...')
train_chunks = np.array_split(train_data, len(devices))
test_chunks = np.array_split(test_data, len(devices))

# write all chunks to ./datasets/devices
for i in range(len(devices)):
    train_chunks[i].to_csv(f'configs/datasets/devices/train_data_{i}.csv', index=False)
    test_chunks[i].to_csv(f'configs/datasets/devices/test_data_{i}.csv', index=False)

print('Data has been downloaded and split into chunks for each device.\nContents of ./datatsets/devices/:')
! ls -ltr configs/datasets/devices/

Step 3 - Write data to devices#

The following uses the OctaiPipe data_writing step to load the data split above to the registered devices.

The python code can be found in ./write_data.py it ensures the split data is present, completes the config templates with details about the relevant devices and file paths and starts two data_writing steps, one for writing test and one for train, on each device.

The train and test data will be sent to each of the devices at the paths /tmp/higgs_train_data.csv and /tmp/higgs_test_data.csv respectively. Check the output of the cells to enure there are no issues downloading and sending the datasets. If possible, check the devices themselves for the presence of the test and train CSVs

[ ]:
from write_data import write_data_to_devices
write_data_to_devices(devices)

Step 4 - Setting up the FL experiment#

Now the devices are ready to take part in an FL experiment.

Logs set up#

Here we’re setting up the level of information we wish to see on the logs.

[ ]:
import logging
import os

os.environ['OCTAIPIPE_DEBUG'] = 'true'
logging.basicConfig(level=logging.INFO, format='%(message)s')

# For more verbose logs, uncomment the following line
# logging.basicConfig(level=logging.DEBUG, format='%(message)s')

OctaiFL context set up#

We then set up the octaifl context by passing the config file, name and description to OctaiFL

[ ]:
from octaipipe.federated_learning.run_fl import OctaiFL

federated_learning_config = 'configs/xgboost_federated_learning.yaml'

octaifl = OctaiFL(
    federated_learning_config,
    deployment_name='FL XGBoost tutorial deployment',
    deployment_description='Deployment part of FL XGBoost tutorial'
    )

Strategy set up#

The strategy describes how models trained on the devices will be aggregated together by the server. You can check the current (default) FL strategy by running the cell below. In the following cell, we will show how to update it based on your requirements. (see the docs for more on strategy settings)

[ ]:
octaifl.strategy.get_strategy()

FL XGBoost in OctaiPipe is able to perform multiple training rounds on the device before model aggregation. In order to set this, we will set the num_local_rounds option in the strategy to 2.

If we wanted to reduce the impact of imbalanced dataset sizes on the devices we would set the normalized_learning_rate to True in the same way. In this tutorial, the datasets sent to devices are all the same size.

Let’s now update the strategy, and print it again so we can see how it has changed.

[ ]:
strategy = {
    'num_rounds': 5, "num_local_rounds": 2
}

octaifl.strategy.set_strategy_values(strategy)
octaifl.strategy.get_strategy()

Step 5 - Running the FL experiment#

Finally, now that we are all set, we can run the experiment.

[ ]:
octaifl.run()

Checking the processes#

There are now two sets of processes running, the server (in Kubernetes) and the clients (on your devices).

The Jupyter Notebook will stream the logs of the server and display them below the running cell.

You can also log into the device to get the client deployment logs by running docker logs -f <id of container>

You can also visualise the progress of the experiment on the Octaipipe portal. The Jupyter Notebook logs will provide a link to it.

[ ]: