DDPG (Deep Deterministic Policy Gradient) with TianShou¶
DDPG (Deep Deterministic Policy Gradient) [LHP+16] is a popular RL algorithm for continuous control. In this tutorial, we show, step by step, how to write neural networks and use DDPG to train the networks with Tianshou. .. The full script is at
TianShou is built following a very simple idea: Deep RL still trains deep neural nets with some loss functions or optimizers on minibatches of data. The only differences between Deep RL and supervised learning are the RL-specific loss functions/optimizers and acquisition of training data. Therefore, we wrap up the RL-specific parts in TianShou, while still expose the TensorFlow-level interfaces for you to train your neural policies/value functions. As a result, doing Deep RL with TianShou is almost as simple as doing supervised learning with TensorFlow.
We now demonstrate a typical routine of doing Deep RL with TianShou.
Make an Environment¶
Always first of all, you have to make an environment for your agent to act in. For the environment interfaces we follow the convention of OpenAI Gym. Just do
pip install gym
in your terminal if you haven’t installed it yet, and you will be able to run the simple example scripts provided by us.
Then, in your Python code, simply import TianShou and make the environment:
import tianshou as ts
env = gym.make('Pendulum-v0')
Pendulum-v0 is a simple environment with a continuous action space, for which DDPG applies. You have to identify the whether the action space is continuous or discrete, and apply eligible algorithms. DQN [MKS+15], for example, could only be applied to discrete action spaces, while almost all other policy gradient methods could be applied to both, depending on the probability distribution on action.
Build the Networks¶
As in supervised learning, we proceed to build the neural networks with TensorFlow. Contrary to existing Deep RL libraries ( keras-rl, rllab, TensorForce ), which could only accept a config specification of network layers and neurons, TianShou naturally supports all TensorFlow APIs when building the neural networks. In fact, the networks in TianShou are still built with direct TensorFlow APIs without any encapsulation, making it fairly easy to use dropout, batch-norm, skip-connections and other advanced neural architectures.
As usual, we start with placeholders that define the network input:
observation_dim = env.observation_space.shape
action_dim = env.action_space.shape
observation_ph = tf.placeholder(tf.float32, shape=(None,) + observation_dim)
action_ph = tf.placeholder(tf.float32, shape=(None,) + action_dim)
And build MLPs for this simple environment. DDPG requires both an actor (the deterministic policy) and a critic (\(Q(s, a)\)):
net = tf.layers.dense(observation_ph, 32, activation=tf.nn.relu)
net = tf.layers.dense(net, 32, activation=tf.nn.relu)
action = tf.layers.dense(net, action_dim[0], activation=None)
action_value_input = tf.concat([observation_ph, action_ph], axis=1)
net = tf.layers.dense(action_value_input, 64, activation=tf.nn.relu)
net = tf.layers.dense(net, 64, activation=tf.nn.relu)
action_value = tf.layers.dense(net, 1, activation=None)
However, DDPG also requires a slowly-tracking copy of these networks as the “target networks”. Target networks are common in RL algorithms, where many off-policy algorithms explicitly requires them to stabilize training [MKS+15][LHP+16], and they could simplify the construction of other objectives such as the probability ration or the KL divergence between the new and old action distribution [SLA+15][SWD+17].
Due to such universality of an old-copy the neural networks (we term it “old net”, considering not all such networks are used to compute targets), we introduce the first paradigm of TianShou:
All parts of the TensorFlow graph construction, except placeholder instantiation,
have to be wrapped in a single parameter-less Python function by you.
The function must return a doublet, (policy head, value head),
with the unnecessary head (if any) set to ``None``.
Note
This paradigm also prescribes the return value of the network function. Such architecture with two “head”s is established by the indispensable role of policies and value functions in RL, and also supported by the use of both networks in, for example, [MBM+16][LHP+16][SSS+17]. This paradigm also allows arbitrary layer sharing between the policy and value networks.
TianShou will then call this function to create the network graphs and optionally the “old net”s according to a single parameter set by you, as in:
def my_network():
net = tf.layers.dense(observation_ph, 32, activation=tf.nn.relu)
net = tf.layers.dense(net, 32, activation=tf.nn.relu)
action = tf.layers.dense(net, action_dim[0], activation=None)
action_value_input = tf.concat([observation_ph, action_ph], axis=1)
net = tf.layers.dense(action_value_input, 64, activation=tf.nn.relu)
net = tf.layers.dense(net, 64, activation=tf.nn.relu)
action_value = tf.layers.dense(net, 1, activation=None)
return action, action_value
actor = ts.policy.Deterministic(my_network, observation_placeholder=observation_ph,
has_old_net=True)
critic = ts.value_function.ActionValue(my_network, observation_placeholder=observation_ph,
action_placeholder=action_ph, has_old_net=True)
You pass the function handler my_network
to TianShou’s policy and value network wrappers,
and also the corresponding placeholders. The has_old_net
controls the construction of the
old net, and is False
by default. When set to True
as in this tutorial, the actor
and critic
will automatically create two sets of networks, the current network and the
old net, and manages them together.
The only behavior provided by the network wrappers on old net is sync_weights()
, which copies
the weights of the current network to the old net. Although it’s sufficient for other scenarios with old nets
[MKS+15][SLA+15][SWD+17], DDPG proposes soft update on the
old nets. Therefore, TianShou provides an additional utility for such soft update:
soft_update_op = ts.get_soft_update_op(1e-2, [actor, critic])
For detailed usage please refer to the API doc of tianshou.core.utils.get_soft_update_op()
. This utility
function gives you the runnable TensorFlow ops the perform soft update, i.e., you can simply do
sess.run(soft_update_op)
whenever you want soft update.
Construct Optimization Methods¶
One of the two key differences between Deep RL and supervised learning is the optimization algorithms. Contrary to existing Deep RL projects ( OpenAI Baselines, Coach, keras-rl, rllab, TensorForce ), which wraps all the optimization operations in one class, we provide optimization techniques only to the least necessary level, allowing natural combination of, for example, native TensorFlow optimizers and gradient clipping operations. We identify three levels of optimization encapsulation, namely loss, gradient and optimizer, and implement RL techniques to one of these levels.
TianShou’s loss
resembles tf.losses
, and to apply L2 loss on the critic in DDPG you could simply do:
critic_loss = ts.losses.value_mse(critic)
critic_optimizer = tf.train.AdamOptimizer(1e-3)
critic_train_op = critic_optimizer.minimize(critic_loss, var_list=list(critic.trainable_variables))
Note
The trainable_variables
property of network wrappers returns a Python set rather than
a Python list. This is for the cases where actor and critic have shared layers, so you have to
explicitly convert it to a list.
For the deterministic policy gradient [LHP+16] which is difficulty to be
conceptualized as gradients over a loss function under TianShou’s paradigm, we wrap it up into the
gradient
level, which directly computes and returns gradients as
tf.train.Optimizer.compute_gradients()
does. It can then be seamlessly combined with
tf.train.Optimizer.apply_gradients()
to optimize the actor:
dpg_grads_vars = ts.opt.DPG(actor, critic)
actor_optimizer = tf.train.AdamOptimizer(1e-3)
actor_train_op = actor_optimizer.apply_gradients(dpg_grads_vars)
Specify Data Acquisition¶
The other key differences between Deep RL and supervised learning is the data acquisition process. Contrary to existing Deep RL projects ( OpenAI Baselines, Coach, keras-rl, rllab, TensorForce ), which mixes up data acquisition and all the optimization operations in one class, we separate it from optimization, facilitating more opportunities of combinations.
First, we instantiate a replay buffer to store the off-policy experiences
data_buffer = ts.data.VanillaReplayBuffer(capacity=10000, nstep=1)
All data buffers in TianShou store only the raw data of each episode, i.e., frames of data in the canonical RL form of tuple: (observation, action, reward, done_flag). Such raw data have to be processed before feeding to the optimization algorithms, so we specify the processing functions in a Python list
process_functions = [ts.data.advantage_estimation.ddpg_return(actor, critic)]
We are now ready to fully specify the data acquisition process
data_collector = ts.data.DataCollector(
env=env,
policy=actor,
data_buffer=data_buffer,
process_functions=process_functions,
managed_networks=[actor, critic]
)
The process_functions
should be a list of Python callables, which you could also implement your own
following the APIs in tianshou.data.advantage_estimation
. You should also pass a Python list of
network wrappers, managed_networks
(in this case [actor, critic]
), to DataCollector
, which
brings up the second paradigm of TianShou:
All canonical RL placeholders (observation, action, return/advantage)
are automatically managed by TianShou.
You only have to create at most the placeholders for observation and action.
Other placeholders, such as the dropout ratio and batch-norm phase, should be managed by you, though.
We provide an entry my_feed_dict
in all functions that may involve such cases.
Start Training!¶
Finally, we are all set and let the training begin!:
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
with tf.Session(config=config) as sess:
sess.run(tf.global_variables_initializer())
# assign actor to pi_old
actor.sync_weights()
critic.sync_weights()
start_time = time.time()
data_collector.collect(num_timesteps=5000) # warm-up
for i in range(int(1e8)):
# collect data
data_collector.collect(num_timesteps=1, episode_cutoff=200)
# train critic
feed_dict = data_collector.next_batch(batch_size)
sess.run(critic_train_op, feed_dict=feed_dict)
# recompute action
data_collector.denoise_action(feed_dict)
# train actor
sess.run(actor_train_op, feed_dict=feed_dict)
# update target networks
sess.run(soft_update_op)
# test every 1000 training steps
if i % 1000 == 0:
print('Step {}, elapsed time: {:.1f} min'.format(i, (time.time() - start_time) / 60))
ts.data.test_policy_in_env(actor, env, num_episodes=5, episode_cutoff=200)
Note that, to optimize the actor in DDPG, we have to use the noiseless action computed by the current
actor rather than the sampled action during interaction with the environment, hence
data_collector.denoise_action(feed_dict)
before running actor_train_op
.
We’ve made the effort for the training process in TianShou also resembles conventional supervised learning
with TensorFlow. Our DataCollector
automatically the feed_dict
for the canonical RL placeholders.
Enjoy and have fun!
References
[LHP+16] | (1, 2, 3, 4) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In ICLR. 2016. |
[MKS+15] | (1, 2, 3) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, and others. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015. |
[SLA+15] | (1, 2) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In ICML. 2015. |
[SWD+17] | (1, 2) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. |
[MBM+16] | Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML. 2016. |
[SSS+17] | David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, and others. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017. |