DDPG (Deep Deterministic Policy Gradient) with TianShou ======================================================= DDPG (Deep Deterministic Policy Gradient) :cite:`lillicrap2015continuous` is a popular RL algorithm for continuous control. In this tutorial, we show, step by step, how to write neural networks and use DDPG to train the networks with Tianshou. .. The full script is at TianShou is built following a very simple idea: Deep RL still trains deep neural nets with some loss functions or optimizers on minibatches of data. The only differences between Deep RL and supervised learning are the RL-specific loss functions/optimizers and acquisition of training data. Therefore, we wrap up the RL-specific parts in TianShou, while still expose the TensorFlow-level interfaces for you to train your neural policies/value functions. As a result, doing Deep RL with TianShou is almost as simple as doing supervised learning with TensorFlow. We now demonstrate a typical routine of doing Deep RL with TianShou. Make an Environment ------------------- Always first of all, you have to make an environment for your agent to act in. For the environment interfaces we follow the convention of `OpenAI Gym `_. Just do :: pip install gym in your terminal if you haven't installed it yet, and you will be able to run the simple example scripts provided by us. Then, in your Python code, simply import TianShou and make the environment:: import tianshou as ts env = gym.make('Pendulum-v0') Pendulum-v0 is a simple environment with a continuous action space, for which DDPG applies. You have to identify the whether the action space is continuous or discrete, and apply eligible algorithms. DQN :cite:`mnih2015human`, for example, could only be applied to discrete action spaces, while almost all other policy gradient methods could be applied to both, depending on the probability distribution on action. Build the Networks ------------------ As in supervised learning, we proceed to build the neural networks with TensorFlow. Contrary to existing Deep RL libraries ( `keras-rl `_, `rllab `_, `TensorForce `_ ), which could only accept a config specification of network layers and neurons, TianShou naturally supports **all** TensorFlow APIs when building the neural networks. In fact, the networks in TianShou are still built with direct TensorFlow APIs without any encapsulation, making it fairly easy to use dropout, batch-norm, skip-connections and other advanced neural architectures. As usual, we start with placeholders that define the network input: :: observation_dim = env.observation_space.shape action_dim = env.action_space.shape observation_ph = tf.placeholder(tf.float32, shape=(None,) + observation_dim) action_ph = tf.placeholder(tf.float32, shape=(None,) + action_dim) And build MLPs for this simple environment. DDPG requires both an actor (the deterministic policy) and a critic (:math:`Q(s, a)`):: net = tf.layers.dense(observation_ph, 32, activation=tf.nn.relu) net = tf.layers.dense(net, 32, activation=tf.nn.relu) action = tf.layers.dense(net, action_dim[0], activation=None) action_value_input = tf.concat([observation_ph, action_ph], axis=1) net = tf.layers.dense(action_value_input, 64, activation=tf.nn.relu) net = tf.layers.dense(net, 64, activation=tf.nn.relu) action_value = tf.layers.dense(net, 1, activation=None) However, DDPG also requires a slowly-tracking copy of these networks as the "target networks". Target networks are common in RL algorithms, where many off-policy algorithms explicitly requires them to stabilize training :cite:`mnih2015human,lillicrap2015continuous`, and they could simplify the construction of other objectives such as the probability ration or the KL divergence between the new and old action distribution :cite:`schulman2015trust,schulman2017proximal`. Due to such universality of an old-copy the neural networks (we term it "old net", considering not all such networks are used to compute targets), we introduce the first paradigm of TianShou: :: All parts of the TensorFlow graph construction, except placeholder instantiation, have to be wrapped in a single parameter-less Python function by you. The function must return a doublet, (policy head, value head), with the unnecessary head (if any) set to ``None``. .. note:: This paradigm also prescribes the return value of the network function. Such architecture with two "head"s is established by the indispensable role of policies and value functions in RL, and also supported by the use of both networks in, for example, :cite:`mnih2016asynchronous,lillicrap2015continuous,silver2017mastering`. This paradigm also allows arbitrary layer sharing between the policy and value networks. TianShou will then call this function to create the network graphs and optionally the "old net"s according to a single parameter set by you, as in:: def my_network(): net = tf.layers.dense(observation_ph, 32, activation=tf.nn.relu) net = tf.layers.dense(net, 32, activation=tf.nn.relu) action = tf.layers.dense(net, action_dim[0], activation=None) action_value_input = tf.concat([observation_ph, action_ph], axis=1) net = tf.layers.dense(action_value_input, 64, activation=tf.nn.relu) net = tf.layers.dense(net, 64, activation=tf.nn.relu) action_value = tf.layers.dense(net, 1, activation=None) return action, action_value actor = ts.policy.Deterministic(my_network, observation_placeholder=observation_ph, has_old_net=True) critic = ts.value_function.ActionValue(my_network, observation_placeholder=observation_ph, action_placeholder=action_ph, has_old_net=True) You pass the function handler ``my_network`` to TianShou's policy and value network wrappers, and also the corresponding placeholders. The ``has_old_net`` controls the construction of the old net, and is ``False`` by default. When set to ``True`` as in this tutorial, the ``actor`` and ``critic`` will automatically create two sets of networks, the current network and the old net, and manages them together. The only behavior provided by the network wrappers on old net is :func:`sync_weights`, which copies the weights of the current network to the old net. Although it's sufficient for other scenarios with old nets :cite:`mnih2015human,schulman2015trust,schulman2017proximal`, DDPG proposes soft update on the old nets. Therefore, TianShou provides an additional utility for such soft update: :: soft_update_op = ts.get_soft_update_op(1e-2, [actor, critic]) For detailed usage please refer to the API doc of :func:`tianshou.core.utils.get_soft_update_op`. This utility function gives you the runnable TensorFlow ops the perform soft update, i.e., you can simply do ``sess.run(soft_update_op)`` whenever you want soft update. Construct Optimization Methods ------------------------------ One of the two key differences between Deep RL and supervised learning is the optimization algorithms. Contrary to existing Deep RL projects ( `OpenAI Baselines `_, `Coach `_, `keras-rl `_, `rllab `_, `TensorForce `_ ), which wraps all the optimization operations in one class, we provide optimization techniques only to the least necessary level, allowing natural combination of, for example, native TensorFlow optimizers and gradient clipping operations. We identify three levels of optimization encapsulation, namely loss, gradient and optimizer, and implement RL techniques to one of these levels. TianShou's ``loss`` resembles ``tf.losses``, and to apply L2 loss on the critic in DDPG you could simply do:: critic_loss = ts.losses.value_mse(critic) critic_optimizer = tf.train.AdamOptimizer(1e-3) critic_train_op = critic_optimizer.minimize(critic_loss, var_list=list(critic.trainable_variables)) .. note:: The ``trainable_variables`` property of network wrappers returns a Python **set** rather than a Python list. This is for the cases where actor and critic have shared layers, so you have to explicitly convert it to a list. For the deterministic policy gradient :cite:`lillicrap2015continuous` which is difficulty to be conceptualized as gradients over a loss function under TianShou's paradigm, we wrap it up into the ``gradient`` level, which directly computes and returns gradients as :func:`tf.train.Optimizer.compute_gradients` does. It can then be seamlessly combined with :func:`tf.train.Optimizer.apply_gradients` to optimize the actor: :: dpg_grads_vars = ts.opt.DPG(actor, critic) actor_optimizer = tf.train.AdamOptimizer(1e-3) actor_train_op = actor_optimizer.apply_gradients(dpg_grads_vars) Specify Data Acquisition ------------------------ The other key differences between Deep RL and supervised learning is the data acquisition process. Contrary to existing Deep RL projects ( `OpenAI Baselines `_, `Coach `_, `keras-rl `_, `rllab `_, `TensorForce `_ ), which mixes up data acquisition and all the optimization operations in one class, we separate it from optimization, facilitating more opportunities of combinations. First, we instantiate a replay buffer to store the off-policy experiences :: data_buffer = ts.data.VanillaReplayBuffer(capacity=10000, nstep=1) All data buffers in TianShou store only the raw data of each episode, i.e., frames of data in the canonical RL form of tuple: (observation, action, reward, done_flag). Such raw data have to be processed before feeding to the optimization algorithms, so we specify the processing functions in a Python list :: process_functions = [ts.data.advantage_estimation.ddpg_return(actor, critic)] We are now ready to fully specify the data acquisition process :: data_collector = ts.data.DataCollector( env=env, policy=actor, data_buffer=data_buffer, process_functions=process_functions, managed_networks=[actor, critic] ) The ``process_functions`` should be a list of Python callables, which you could also implement your own following the APIs in :mod:`tianshou.data.advantage_estimation`. You should also pass a Python list of network wrappers, ``managed_networks`` (in this case ``[actor, critic]``), to ``DataCollector``, which brings up the second paradigm of TianShou: :: All canonical RL placeholders (observation, action, return/advantage) are automatically managed by TianShou. You only have to create at most the placeholders for observation and action. Other placeholders, such as the dropout ratio and batch-norm phase, should be managed by you, though. We provide an entry ``my_feed_dict`` in all functions that may involve such cases. Start Training! --------------- Finally, we are all set and let the training begin!:: config = tf.ConfigProto() config.gpu_options.allow_growth = True with tf.Session(config=config) as sess: sess.run(tf.global_variables_initializer()) # assign actor to pi_old actor.sync_weights() critic.sync_weights() start_time = time.time() data_collector.collect(num_timesteps=5000) # warm-up for i in range(int(1e8)): # collect data data_collector.collect(num_timesteps=1, episode_cutoff=200) # train critic feed_dict = data_collector.next_batch(batch_size) sess.run(critic_train_op, feed_dict=feed_dict) # recompute action data_collector.denoise_action(feed_dict) # train actor sess.run(actor_train_op, feed_dict=feed_dict) # update target networks sess.run(soft_update_op) # test every 1000 training steps if i % 1000 == 0: print('Step {}, elapsed time: {:.1f} min'.format(i, (time.time() - start_time) / 60)) ts.data.test_policy_in_env(actor, env, num_episodes=5, episode_cutoff=200) Note that, to optimize the actor in DDPG, we have to use the noiseless action computed by the current actor rather than the sampled action during interaction with the environment, hence ``data_collector.denoise_action(feed_dict)`` before running ``actor_train_op``. We've made the effort for the training process in TianShou also resembles conventional supervised learning with TensorFlow. Our ``DataCollector`` automatically the ``feed_dict`` for the canonical RL placeholders. Enjoy and have fun! .. rubric:: References .. bibliography:: ../refs.bib :style: unsrtalpha