RL Environments#

As with other RL model frameworks, you will need to set up your environment before you start modeling in OctaiPipe. OctaiPipe offers the ability to build custom environments based on the Gymnasium Env class.

Custom environments need to inherit from the Gymnasium Env class and requires the user to implement the __init__, reset, and step methods:

__init__: initializes variables, e.g. action and observation space
reset: resets environment to original state and return initial observation
step: runs one time step in environment given action
render: renders environment if desired. Can be implemented as dummy class

An example minimal env class can be found below:

Example gym Env class#

import gymnasium as gym
from gymnasium import spaces
import numpy as np


class MyCustomEnv(gym.Env):
    metadata = {"render_modes": ["human"], "render_fps": 30}

    def __init__(self, obs_low: int = 0, obs_high: int = 1):
        super().__init__()
        self.observation_space = spaces.Box(low=obs_low, high=obs_high, shape=(4,), dtype=np.float32)
        self.action_space = spaces.Discrete(2)

    def reset(self, *, seed=None, options=None):
        super().reset(seed=seed)
        self.state = np.random.rand(4)
        info = {}
        return self.state, info

    def step(self, action):
        self.state = np.random.rand(4)
        reward = 1.0
        terminated = False
        truncated = False
        info = {}
        return self.state, reward, terminated, truncated, info

    def render(self, mode='human'):
        # NOTE: render is always expected, but can be passed as dummy class
        print(f"State: {self.state}")

    def close(self):
        pass

In the FL config, we define the env within the model_params in the model_specs field. We specify the path to the env file with the path field and any parameters to give to the __init__ method in params:

 model_specs:
   type: frl
   name: test_model
   model_params:
     policy:
       name: PPO
       params: {}
     env:
       path: ./path/to/env_file.py
       params:
         obs_low: 0
         obs_high: 1

Data Loading for FRL#

For some environments, it might be useful to read data from a file or database and use it for training the model. To do this, build the data loading into your custom environment file and use the data as any other object.

Remember that any file paths present on device need to be specified in the FL config, so if you are reading from a file, include it as an initialization parameter for your environment in the FL config file.

In the example below, we use the env parameter init_states_path to specify a file that we wish to use as initial states of the environment when we run the reset() method.

The example code and config are below.

Data loading env#

Example gym Env class for data loading#

import gymnasium as gym
from gymnasium import spaces
import numpy as np


class MyCustomEnv(gym.Env):
    metadata = {"render_modes": ["human"], "render_fps": 30}

    def __init__(self, init_states_path: str):
        super().__init__()
        self.init_states = pd.read_csv(init_states_path)
        self.cur_state = 0
        self.observation_space = spaces.Box(shape=self.init_states.shape, dtype=np.float32)
        self.action_space = spaces.Discrete(2)

    def reset(self, *, seed=None, options=None):
        super().reset(seed=seed)
        self.state = self.init_states.iloc[self.cur_state,]
        self.cur_state += 1
        if self.cur_state >= self.init_states.shape[0]:
            self.cur_state = 0
        info = {}
        return self.state, info

    def step(self, action):
        self.state = np.random.rand(self.data.shape[0])
        reward = 1.0
        terminated = False
        truncated = False
        info = {}
        return self.state, reward, terminated, truncated, info

    def render(self, mode='human'):
        print(f"State: {self.state}")

    def close(self):
        pass

Data loading config#

 model_specs:
   type: FRL
   load_existing: false
   name: test_model
   model_params:
     policy:
       name: PPO
       params: {}
     env:
       path: ./path/to/env_file.py
       params:
         init_states_path: ./path/to/file.csv

Using environments in training#

The training of the RL model on each FL client is simply executed by running the learn() method from the policy class. The RL environment is set on initialization of the policy class and any logic pertaining to training should be defined with this in mind, i.e. arguments are not handed to the learn() method at runtime.

Evaluation environment setup#

The evaluation of RL agents is carried out using the method below. Note that the environment defined in the env section of the FL config is handed to the test method at runtime. If the environment class has attributes for n_episodes, render, or max_steps_per_test_episode, these are used in evaluation, else defaults are set.

RL eval method#

def test_model(self, test_env: gym.Env) -> dict:
    if hasattr(test_env, "n_test_episodes"):
        n_episodes: int = test_env.n_test_episodes
    else:
        n_episodes: int = 5

    if hasattr(test_env, "render"):
        render: bool = test_env.render
    else:
        render: bool = False

    if hasattr(test_env, "max_steps_per_test_episode"):
        max_steps_per_test_episode: int = test_env.max_steps_per_test_episode
    else:
        max_steps_per_test_episode: int = 10000

    rewards, lengths = [], []
    for _ in range(n_episodes):
        reset_out = test_env.reset()
        obs = reset_out[0] if isinstance(reset_out, tuple) else reset_out
        done = False
        total_r: float = 0.0
        steps: int = 0
        while not done and steps < max_steps_per_test_episode:
            raw_action, _ = self.predict(obs, deterministic=True)
            action = self.format_action(test_env, raw_action)
            obs, r, done, info = self.step_and_unpack(test_env, action)
            total_r += r
            steps += 1
            if render:
                test_env.render()
        rewards.append(total_r)
        lengths.append(steps)

    if not rewards or not lengths:
        raise RuntimeError('No rewards or lengths retrieved, FRL '
                           'evaluation did not complete successfully. Got\n'
                           f'rewards: {rewards}, lengths: {lengths}')
    return {
        "mean_reward": float(np.mean(rewards)),
        "std_reward": float(np.std(rewards)) if len(rewards) > 1 else np.nan,
        "min_reward": float(np.min(rewards)),
        "max_reward": float(np.max(rewards)),
        "mean_length": float(np.mean(lengths)),
    }