tianshou.core.policy¶

Base class¶

Deterministic policy¶

class tianshou.core.policy.deterministic.Deterministic(network_callable, observation_placeholder, has_old_net=False, random_process=None)[source]¶

Bases: tianshou.core.policy.base.PolicyBase

Deterministic policy as used in deterministic policy gradient (DDPG) methods. It can only be used with continuous action space. The output of the policy network is directly the action.

Parameters:

network_callable – A Python callable returning (action head, value head). When called it builds the tf graph and returns a Tensor of the action on the action head.
observation_placeholder – A tf.placeholder. The observation placeholder of the network graph.
has_old_net – A bool defaulting to False. If true this class will create another graph with another set of tf.Variable s to be the “old net”. The “old net” could be the target networks as in DQN and DDPG, or just an old net to help optimization as in PPO.
random_process – Optional. A RandomProcess. The additional random process for exploration. Defaults to an OrnsteinUhlenbeckProcess with \(\theta=0.15\) and \(\sigma=0.3\) if not set explicitly.

act(observation, my_feed_dict={})[source]¶

Return action given observation, adding the exploration noise sampled from self.random_process.

Parameters:	observation – An array-like with rank the same as a single observation of the environment. Its “batch_size” is 1, but should not be explicitly set. This method will add the dimension of “batch_size” to the first dimension. my_feed_dict – Optional. A dict defaulting to empty. Specifies placeholders such as dropout and batch_norm except observation.
Returns:	A numpy array. Action given the single observation. Its “batch_size” is 1, but should not be explicitly set.

act_test(observation, my_feed_dict={})[source]¶

Return action given observation, removing the exploration noise.

Parameters:	observation – An array-like with rank the same as a single observation of the environment. Its “batch_size” is 1, but should not be explicitly set. This method will add the dimension of “batch_size” to the first dimension. my_feed_dict – Optional. A dict defaulting to empty. Specifies placeholders such as dropout and batch_norm except observation.
Returns:	A numpy array. Action given the single observation. Its “batch_size” is 1, but should not be explicitly set.

eval_action(observation, my_feed_dict={})[source]¶

Evaluate action in minibatch using the current network.

Parameters:	observation – An array-like. Contrary to `act()` and `act_test()`, it has the dimension of batch_size. my_feed_dict – Optional. A dict defaulting to empty. Specifies placeholders such as dropout and batch_norm except observation.
Returns:	A numpy array with the batch_size dimension and same batch_size as `observation`.

eval_action_old(observation, my_feed_dict={})[source]¶

Evaluate action in minibatch using the old net.

Parameters:	observation – An array-like. Contrary to `act()` and `act_test()`, it has the dimension of batch_size. my_feed_dict – Optional. A dict defaulting to empty. Specifies placeholders such as dropout and batch_norm except observation.
Returns:	A numpy array with the batch_size dimension and same batch_size as `observation`.

reset()[source]¶: Reset the internal states of self.random_process.

sync_weights()[source]¶: Sync the variables of the “old net” to be the same as the current network.

trainable_variables¶: The trainable variables of the policy in a Python set. It contains only the tf.Variable s that affect the action.

Distributional policy¶

class tianshou.core.policy.distributional.Distributional(network_callable, observation_placeholder, has_old_net=False)[source]¶

Bases: tianshou.core.policy.base.PolicyBase

Policy class where action is specified by a probability distribution. Depending on the distribution, it can be applied to both continuous and discrete action spaces.

Parameters:

network_callable – A Python callable returning (action head, value head). When called it builds the tf graph and returns a tf.distributions.Distribution on the action space on the action head.
observation_placeholder – A tf.placeholder. The observation placeholder of the network graph.
has_old_net – A bool defaulting to False. If true this class will create another graph with another set of tf.Variable s to be the “old net”. The “old net” could be the target networks as in DQN and DDPG, or just an old net to help optimization as in PPO.

act(observation, my_feed_dict={})[source]¶

Return action given observation, directly sampling from the action distribution.

Parameters:	observation – An array-like with rank the same as a single observation of the environment. Its “batch_size” is 1, but should not be explicitly set. This method will add the dimension of “batch_size” to the first dimension. my_feed_dict – Optional. A dict defaulting to empty. Specifies placeholders such as dropout and batch_norm except observation.
Returns:	A numpy array. Action given the single observation. Its “batch_size” is 1, but should not be explicitly set.

act_test(observation, my_feed_dict={})[source]¶

Return action given observation, directly sampling from the action distribution.

Parameters:	observation – An array-like with rank the same as a single observation of the environment. Its “batch_size” is 1, but should not be explicitly set. This method will add the dimension of “batch_size” to the first dimension. my_feed_dict – Optional. A dict defaulting to empty. Specifies placeholders such as dropout and batch_norm except observation.
Returns:	A numpy array. Action given the single observation. Its “batch_size” is 1, but should not be explicitly set.

reset()¶: Reset the internal states of the policy. Does nothing by default.

sync_weights()[source]¶: Sync the variables of the “old net” to be the same as the current network.

trainable_variables¶: The trainable variables of the policy in a Python set. It contains only the tf.Variable s that affect the action.

DQN policy¶

class tianshou.core.policy.dqn.DQN(dqn, epsilon_train=0.1, epsilon_test=0.05)[source]¶

Bases: tianshou.core.policy.base.PolicyBase

use DQN from value_function as a member

Policy derived from a Deep-Q Network (DQN). It should be constructed from a tianshou.core.value_function.DQN. Action is the argmax of the Q-values (usually with further \(\epsilon\)-greedy). It can only be applied to discrete action spaces.

Parameters:	dqn – A `tianshou.core.value_function.DQN`. The Q-value network to derive this policy. epsilon_train – A float in range \([0, 1]\). The \(\epsilon\) used in \(\epsilon\)-greedy during training while interacting with the environment. epsilon_test – A float in range \([0, 1]\). The \(\epsilon\) used in \(\epsilon\)-greedy during test while interacting with the environment.

act(observation, my_feed_dict={})[source]¶

Return action given observation, with \(\epsilon\)-greedy using self.epsilon_train.

Parameters:	observation – An array-like with rank the same as a single observation of the environment. Its “batch_size” is 1, but should not be explicitly set. This method will add the dimension of “batch_size” to the first dimension. my_feed_dict – Optional. A dict defaulting to empty. Specifies placeholders such as dropout and batch_norm except observation.
Returns:	A numpy array. Action given the single observation. Its “batch_size” is 1, but should not be explicitly set.

act_test(observation, my_feed_dict={})[source]¶

Return action given observation, with \(\epsilon\)-greedy using self.epsilon_test.

Parameters:	observation – An array-like with rank the same as a single observation of the environment. Its “batch_size” is 1, but should not be explicitly set. This method will add the dimension of “batch_size” to the first dimension. my_feed_dict – Optional. A dict defaulting to empty. Specifies placeholders such as dropout and batch_norm except observation.
Returns:	A numpy array. Action given the single observation. Its “batch_size” is 1, but should not be explicitly set.

q_net¶: The DQN (tianshou.core.value_function.DQN) this policy based on.

reset()¶: Reset the internal states of the policy. Does nothing by default.

set_epsilon_test(epsilon)[source]¶: Set the \(\epsilon\) in \(\epsilon\)-greedy during training. :param epsilon: A float in range \([0, 1]\).

set_epsilon_train(epsilon)[source]¶: Set the \(\epsilon\) in \(\epsilon\)-greedy during training. :param epsilon: A float in range \([0, 1]\).

sync_weights()[source]¶: Sync the variables of the “old net” to be the same as the current network.