Federated XGBoost#

Introduction#

OctaiPipe has implemented XGBoost within a federated learning (FL) framework, enabling clients to train multiple trees using XGBoost on local datasets before aggregating these trees on a server. This method allows for multiple trees to be sent to the server for aggregation in each iteration, differing from other implementations where only a single tree is trained and sent per iteration.

Additionally, we have implemented a normalized learning rate options to account for varying dataset sizes across clients, aiming to improve model performance and fairness.

Advantages of XGBoost#

XGBoost is preferred for certain scenarios due to its:

Ease of Use: Less complex setup compared to neural networks, making it accessible for a wide range of applications.
Efficiency and Cost Savings: Demonstrates faster training times and reduced computational costs.
Explainability: Facilitates easier extraction of feature importance, enhancing model interpretability.
Performance: Often outperforms neural network-based models on tabular datasets, especially when data size is medium and features are sparse or non-IID (not independently and identically distributed).
Robustness: Effectively handles missing values, a challenge for many neural network models.

Federated XGBoost Process#

In the federated setting, the training process starts with each client training local XGBoost models. These local models, represented by a set number of trees, are then aggregated by the server into a global model. This model is distributed to clients for further training in subsequent rounds, progressively enhancing the model’s accuracy. The cycle of local training, aggregation, and distribution continues until a predefined number of iterations is reached, optimizing the global model while maintaining data privacy.

The best way to understand how to use Federated XGBoost is to run the notebook tutorial in our Jupyter image: Tutorials/04_FLXGBoost/flxgboost-tutorial.

Configurations and Methodologies#

Example config:

name: federated_learning

infrastructure:
  device_ids: [demo_device_0, demo_device_1]  # Add desired devices

strategy:
  num_rounds: 20
  num_local_rounds: 2
  normalized_learning_rate: true

input_data_specs:
  default:
  - datastore_type: influxdb
    settings:
      query_type: dataframe
      query_template_path: ./data/influx_query_def.txt
      query_config:
        start: "2022-11-10T00:00:00.000Z"
        stop: "2022-11-11T00:00:00.000Z"
        bucket: test-bucket
        measurement: sensors-raw
        tags: {}

evaluation_data_specs:
  default:
  - datastore_type: influxdb
    settings:
      query_type: dataframe
      query_template_path: ./data/influx_query_eval_def.txt
      query_config:
        start: "2022-11-10T00:00:00.000Z"
        stop: "2022-11-11T00:00:00.000Z"
        bucket: test-bucket
        measurement: sensors-raw
        tags: {}

model_specs:
  type: base_xgboost
  load_existing: false
  name: test_xgboost
  model_params:
    objective: reg:squarederror
    eta: 0.15
    max_depth: 8
    eval_metric: auc
    nthread: 16
    num_parallel_tree: 1
    subsample: 1
    tree_method: hist

run_specs:
  target_label: RUL
  cycle_id: "Machine number"
  backend: xgboost

There are two configurable features of the FL-XGBoost implementation which will effect the training process:

The XGBoost model parameters which will be used by the clients to train their local models in the model_specs section. These parameters are set in the federated_learning config file detailed below.
The strategy used by the server to handle and aggregate models from clients. This can be set in the federated_learning config file (under section strategy) as well. Or can be configured after the config has been loaded following the example shown here: Customise Strategy Parameters
Adversarial fortification implementations to protect against malicious clients (Adversarial Fortification for XGBoost)

XGBoost model parameters#

https://xgboost.readthedocs.io/en/release_1.7.0/parameter.html#learning-task-parameters

Options set here will be passed to the xgboost.Booster object when the model is initilaised. It allows you to have greater control over the model training process, catering to your specific learning task.

objective: reg:squarederror - Specifies the learning task and the corresponding learning objective. Here, it is set for regression tasks with squared error as the loss function. - Ideal for regression problems where the goal is to minimize the squared differences between the predicted and actual values.
eta: 0.15 (alias: learning_rate) - Controls the step size shrinkage used in the update to prevent overfitting by making the boosting process more conservative. - Recommended option is lower than the default (0.3) to make the model training process more conservative and potentially achieve better generalization on unseen data.
max_depth: 8 - Determines the maximum depth of a tree. Deeper trees can model more complex patterns but might overfit. - Increased from the default (6) to allow the model to capture more complex relationships in the data without excessively risking overfitting.
eval_metric: auc - Evaluation metric for validation data, important for classification tasks. - Using AUC (Area Under the Curve) as it is effective for binary classification problems, particularly useful in imbalanced datasets.
nthread: 16 - Number of parallel threads used to run XGBoost. - Increased to speed up computation. The exact value can be adjusted based on the machine’s CPU cores available for parallel processing.
num_parallel_tree: 1 - Number of trees to grow per iteration. Used for boosted random forests. - Set to 1 for standard boosting. Increasing this creates a forest of trees for each iteration and can be used for models like random forests.
subsample: 1 - Fraction of training instances to be randomly sampled for building trees, to prevent overfitting. - Set to 1 to use all data, indicating confidence in the dataset’s representativeness and the model’s resilience against overfitting.
tree_method: hist - Specifies the tree construction algorithm used in XGBoost. Options include exact, approx, and hist, among others. - Hist is chosen for faster computation time compared to the exact method, especially suitable for datasets with a large number of observations or features.

Strategy options#

Strategies in FL-XGBoost are a lot different than strategies for Neural networks strategy.

There are some common strategy parameters like fraction _fit, fraction_evaluate, min_fit_clients, min_evaluate_clients, min_available_clients.

The most important strategy parameters for FL-XGBoost are the following :

num_local_rounds (int, optional): The number of local rounds of training to perform on the device. The default value is 1.
normalized_learning_rate (bool, optional): Whether to normalize the learning rate based on the number of samples each client contributes. The default value is False.

OctaiPipe provides three strategies for FL_XGBoost :

bare-bone : When the strategy parameter num_local_rounds is 1 meaning clients are training 1 tree at their local end and sending 1 tree to the server for aggregation.
tree+ : When the strategy parameter num_local_rounds is greater than 1 meaning clients are training multiple trees at their end and sending those to the server for aggregation. This will enable the clients to make fewer connections to the server and hence and hence this strategy is more robust to Intermittent devices network connection.
Norm-LR : This strategy can be used by setting normalized_learning_rate = True. This strategy will adjust the learning rate per client according to their number of datapoints hence tackling the data heterogeinity due to different data size among clients.