Tutorial - Adversarial Fortification for FL XGBoost#
This tutorial goes through the process of running FL XGBoost in a setting where there are malicious clients, intending to sabotage model performance.
We intentionally set up devices to impede model performance, but show the user how to make use of OctaiPipe’s Adversarial Fortification toolkit to protect against such attacks.
We will go through the following steps:
Getting the dataset#
First, we need to get the dataset we’re using. We will use the Garment dataset from UC Irvine’s Machine Learning Repository.
This is a tabular, industrial time-series dataset predicting employee productivity from features of Garment’s production process.
[ ]:
!pip install ucimlrepo
[ ]:
from ucimlrepo import fetch_ucirepo
# fetch dataset
data = fetch_ucirepo(id=597)
X = data['data']['features']
y = data['data']['targets']
Preprocess and partition dataset#
Next, we preprocess the dataset and partition it into 4 parts. The resulting object datasets
is a list with dictionaries with train and test split accessible like datasets[0][‘train’].
[ ]:
from garment_helper_funcs import preprocess_garment_data, partition_data, send_df_as_csv_file, save_data_locally
data = preprocess_garment_data(X, y)
datasets = partition_data(data)
Introduce problematic data#
In order to mess up model performance, we will now intentionally mess up the data in one of our dataframes. This will simulate a malicious client that gives wrong data and duplicated data in order to throw off the global model.
In this case, the reported data for some teams will have a lower target productivity and higher over time hours worked. The data will also be replicated over and over to make these teams more important for the global model.
[ ]:
import pandas as pd
data_0 = datasets[0]['train'].copy()
data_0['targeted_productivity'] = data_0['targeted_productivity']/2
data_0['targeted_productivity_1'] = data_0['targeted_productivity_1']/2
data_0['over_time'] = data_0['over_time']*4
data_0['over_time_1'] = data_0['over_time_1']*4
data_0 = pd.concat([data_0]*5, axis = 0)
data_0 = data_0.reset_index(drop=True)
datasets[0]['train'] = data_0
Save Data locally#
This will write the partitioned data to ./datasets/devices
.
[ ]:
# Update this with the number of target devices you'll be running on
number_of_target_devices: int = 2
[ ]:
assert number_of_target_devices, "Please update the number_of_devices variable with the number of devices you'll be running on"
for idx in range(number_of_target_devices):
save_data_locally(datasets[idx], idx)
Sending datasets to devices#
This is only possible if you have SSH access to the target devices
The next bit of code sends the data to your devices.
This assumes that you can connect to the devices by SSH from the Jupyter Notebook.
However, the code below also saves each dataset to the datasets
folder in the current working directory. From there, you can download them and thern manually upload to the folder on your devices called /home/{user}/datasets
. Just remember to remove the _{idx} when saving it onto the device.
[ ]:
# replace with real values (can define any number of devices)
devices = {
'test-device-0': {
'ip': 'XXX.XXX.XXX.XXX',
'user': 'octaipipe',
'password': 'password'
},
'test-device-1': {
'ip': 'XXX.XXX.XXX.XXX',
'user': 'octaipipe',
'password': 'password'
},
'test-device-2': {
'ip': 'XXX.XXX.XXX.XXX',
'user': 'octaipipe',
'password': 'password'
},
'test-device-3': {
'ip': 'XXX.XXX.XXX.XXX',
'user': 'octaipipe',
'password': 'password'
}
}
for device, creds in devices.items():
online = 'online'
devices[device]['ssh_command'] = f"sshpass -p {creds['password']} ssh -T -o StrictHostKeyChecking=no {creds['user']}@{creds['ip']}"
devices[device]['scp_command'] = f"sshpass -p {creds['password']} scp -o StrictHostKeyChecking=no"
result = ! {devices[device]['ssh_command']} echo {online}
! {devices[device]['ssh_command']} mkdir -p /home/{creds['user']}/datasets
print(device, '\t', result)
print(f'\nIf any devices are not "{online}" - troubleshoot.')
Send the files over to the edge devices#
This uses scp to send the files over to the devices and stores them in the device user folder.
If this command does not work for you due to not having ssh access to the device or other, the send_df_as_csv_file goes through the following steps:
For each dataset in the datasets
list, save the train and test dataset to a csv called garment_train.csv and garment_test.csv. Send these to each device and save them in the filepath /home/{device_user}/datasets
.
[ ]:
for idx, (device, creds) in enumerate(devices.items()):
send_df_as_csv_file(datasets[idx], idx, device, creds)
Setting up FL XGBoost#
Next we run the FL XGBoost experiment.
We will use the configuration file printed below for all runs.
NOTE: You will need to edit the follwing in this file:
Device IDs in the
device_ids
listIf you wish to use a specific version of OctaiPipe, change the
latest
tag in the image names to the one you would like to useChange the file_path in the input and evaluation data specs to that of your device user, e.g. if your user is linus, your path would be /home/linus/datasets/garment_train.csv. If you have different users for each device, you can add a new device to the devices list and specify the input data specs for that device separately, see data documentation for FL
[ ]:
import yaml
# Display federated learning config
with open("configs/xgboost_adversarial_config.yml", 'r') as file:
inference_config = yaml.safe_load(file)
print(yaml.dump(inference_config, sort_keys=False))
Running FL XGBoost without adversarial fortification#
First, we will run FL without any adversarial fortification, to see how the model performs when it faces problematic clients.
As adversarial fortification is implemented by default, we need to disable it by updating the FL strategy.
This is done by setting the adv_fort
section of the strategy as below
[ ]:
import logging
import os
os.environ['OCTAIPIPE_DEBUG'] = 'true'
logging.basicConfig(level=logging.INFO, format='%(message)s')
# For more verbose logs, uncomment the following line
# logging.basicConfig(level=logging.DEBUG, format='%(message)s')
[ ]:
from octaipipe.federated_learning.run_fl import OctaiFL
federated_learning_config = 'configs/xgboost_adversarial_config.yml'
octaifl = OctaiFL(
federated_learning_config,
deployment_name='FL XGBoost Adv Fort',
deployment_description='FL XGBoost tutorial on Garment dataset without adversarial fortification implementation'
)
strategy = {
'min_available_clients': 4,
'min_fit_clients': 3,
'min_evaluate_clients': 3,
'num_rounds': 5,
'num_local_rounds': 5,
'adv_fort': {
'gain_factor': False,
'eta': False,
'config_check': False
}
}
octaifl.strategy.set_config(strategy)
octaifl.strategy.get_config()
We can now run OctaiFL for XGBoost with no adversarial fortification
[ ]:
octaifl.run()
Adding Adversarial Fortification#
This wouldn’t be a very good tutorial on adversarial fortification if we didn’t implement any adversarial fortification though.
Therefore, we will re-run FL from the previous run, but add back in the adversarial fortification
[ ]:
from octaipipe.federated_learning.run_fl import OctaiFL
federated_learning_config = 'configs/xgboost_adversarial_config.yml'
octaifl = OctaiFL(
federated_learning_config,
deployment_name='FL XGBoost Adv Fort',
deployment_description='FL XGBoost tutorial on Garment dataset with adversarial fortification implementation'
)
strategy = {
'min_available_clients': 4,
'min_fit_clients': 3,
'min_evaluate_clients': 3,
'num_rounds': 5,
'num_local_rounds': 5,
'adv_fort': {
'gain_factor': 1,
'eta': True,
'config_check': True
}
}
octaifl.strategy.set_config(strategy)
octaifl.strategy.get_config()
And again, we run OctaiFL for XGBoost, but this time woth the adversarial fortification
[ ]:
octaifl.run()