RL Policies#

In Reinforcement Learning, the choice of Policy is the set of rules that the agent uses in order to determine its actions given the environment. It maps the current state the agent is in to a desired action.

As an example, let’s imagine an RL agent that suggests whether a person should bring an umbrella when going on a walk or not. A simple policy might look at the current weather and implement an if-else logic where “if it is raining, suggest bringing an umbrella, else do not”.

The example above is a simple rule-based policy using only one input variable. In reality, policies are likely to use a range of variables and the policy itself is likely a more complex set of rules or uses some machine learning model to make suggestions.

Policies in OctaiPipe#

To configure policies in OctaiPipe, simply use the policy field under the model_params in the FL config, see an example below, where we select the PPO policy and use default parameters by setting params to an empty dictionary.

 model_specs:
   type: frl
   name: test_model
   model_params:
     policy:
       name: PPO
       params: {}
     env:
       path: ./path/to/env_file.py
       params: {}

The following policies RL policies are implemented in OctaiPipe:

Proximal Policy Optimization#

Proximal Policy Optimization (PPO) is a popular reinforcement learning algorithm introduced by OpenAI in 2017. It’s known for its balance between performance and simplicity, and is widely used in many modern RL applications.

PPO is selected by setting the policy name to PPO in your FL config. Behind the scenes, OctaiPipe uses a version of stable-baselines3’s (SB3) implementation of PPO. In the example config below, we include all possible arguments that PPO can take. To get further information on each argument, see the SB3 documentation.

 model_specs:
   type: FRL
   load_existing: false
   name: test_model
   model_params:
     policy:
       name: PPO
       params:
         clip_range: 0.5
         n_steps: 128
         batch_size: 128
         n_epochs: 10
         gamma: 0.99
         gae_lambda: 0.95
         ent_coef: 0.05
         vf_coef: 0.5
         max_grad_norm: 0.5
         verbose: 1
         learning_rate: 0.5
         total_timesteps: 10
     env:
       path: ./path/to/env_file.py
       params: {}