tianshou.core.policy¶
Base class¶
Deterministic policy¶
-
class
tianshou.core.policy.deterministic.
Deterministic
(network_callable, observation_placeholder, has_old_net=False, random_process=None)[source]¶ Bases:
tianshou.core.policy.base.PolicyBase
Deterministic policy as used in deterministic policy gradient (DDPG) methods. It can only be used with continuous action space. The output of the policy network is directly the action.
Parameters: - network_callable – A Python callable returning (action head, value head). When called it builds the tf graph and returns a Tensor of the action on the action head.
- observation_placeholder – A
tf.placeholder
. The observation placeholder of the network graph. - has_old_net – A bool defaulting to
False
. If true this class will create another graph with another set oftf.Variable
s to be the “old net”. The “old net” could be the target networks as in DQN and DDPG, or just an old net to help optimization as in PPO. - random_process – Optional. A
RandomProcess
. The additional random process for exploration. Defaults to anOrnsteinUhlenbeckProcess
with \(\theta=0.15\) and \(\sigma=0.3\) if not set explicitly.
-
act
(observation, my_feed_dict={})[source]¶ Return action given observation, adding the exploration noise sampled from
self.random_process
.Parameters: - observation – An array-like with rank the same as a single observation of the environment. Its “batch_size” is 1, but should not be explicitly set. This method will add the dimension of “batch_size” to the first dimension.
- my_feed_dict – Optional. A dict defaulting to empty. Specifies placeholders such as dropout and batch_norm except observation.
Returns: A numpy array. Action given the single observation. Its “batch_size” is 1, but should not be explicitly set.
-
act_test
(observation, my_feed_dict={})[source]¶ Return action given observation, removing the exploration noise.
Parameters: - observation – An array-like with rank the same as a single observation of the environment. Its “batch_size” is 1, but should not be explicitly set. This method will add the dimension of “batch_size” to the first dimension.
- my_feed_dict – Optional. A dict defaulting to empty. Specifies placeholders such as dropout and batch_norm except observation.
Returns: A numpy array. Action given the single observation. Its “batch_size” is 1, but should not be explicitly set.
-
eval_action
(observation, my_feed_dict={})[source]¶ Evaluate action in minibatch using the current network.
Parameters: - observation – An array-like. Contrary to
act()
andact_test()
, it has the dimension of batch_size. - my_feed_dict – Optional. A dict defaulting to empty. Specifies placeholders such as dropout and batch_norm except observation.
Returns: A numpy array with the batch_size dimension and same batch_size as
observation
.- observation – An array-like. Contrary to
-
eval_action_old
(observation, my_feed_dict={})[source]¶ Evaluate action in minibatch using the old net.
Parameters: - observation – An array-like. Contrary to
act()
andact_test()
, it has the dimension of batch_size. - my_feed_dict – Optional. A dict defaulting to empty. Specifies placeholders such as dropout and batch_norm except observation.
Returns: A numpy array with the batch_size dimension and same batch_size as
observation
.- observation – An array-like. Contrary to
-
trainable_variables
¶ The trainable variables of the policy in a Python set. It contains only the
tf.Variable
s that affect the action.
Distributional policy¶
-
class
tianshou.core.policy.distributional.
Distributional
(network_callable, observation_placeholder, has_old_net=False)[source]¶ Bases:
tianshou.core.policy.base.PolicyBase
Policy class where action is specified by a probability distribution. Depending on the distribution, it can be applied to both continuous and discrete action spaces.
Parameters: - network_callable – A Python callable returning (action head, value head). When called it builds the tf graph and returns a
tf.distributions.Distribution
on the action space on the action head. - observation_placeholder – A
tf.placeholder
. The observation placeholder of the network graph. - has_old_net – A bool defaulting to
False
. If true this class will create another graph with another set oftf.Variable
s to be the “old net”. The “old net” could be the target networks as in DQN and DDPG, or just an old net to help optimization as in PPO.
-
act
(observation, my_feed_dict={})[source]¶ Return action given observation, directly sampling from the action distribution.
Parameters: - observation – An array-like with rank the same as a single observation of the environment. Its “batch_size” is 1, but should not be explicitly set. This method will add the dimension of “batch_size” to the first dimension.
- my_feed_dict – Optional. A dict defaulting to empty. Specifies placeholders such as dropout and batch_norm except observation.
Returns: A numpy array. Action given the single observation. Its “batch_size” is 1, but should not be explicitly set.
-
act_test
(observation, my_feed_dict={})[source]¶ Return action given observation, directly sampling from the action distribution.
Parameters: - observation – An array-like with rank the same as a single observation of the environment. Its “batch_size” is 1, but should not be explicitly set. This method will add the dimension of “batch_size” to the first dimension.
- my_feed_dict – Optional. A dict defaulting to empty. Specifies placeholders such as dropout and batch_norm except observation.
Returns: A numpy array. Action given the single observation. Its “batch_size” is 1, but should not be explicitly set.
-
reset
()¶ Reset the internal states of the policy. Does nothing by default.
-
trainable_variables
¶ The trainable variables of the policy in a Python set. It contains only the
tf.Variable
s that affect the action.
- network_callable – A Python callable returning (action head, value head). When called it builds the tf graph and returns a
DQN policy¶
-
class
tianshou.core.policy.dqn.
DQN
(dqn, epsilon_train=0.1, epsilon_test=0.05)[source]¶ Bases:
tianshou.core.policy.base.PolicyBase
use DQN from value_function as a member
Policy derived from a Deep-Q Network (DQN). It should be constructed from a
tianshou.core.value_function.DQN
. Action is the argmax of the Q-values (usually with further \(\epsilon\)-greedy). It can only be applied to discrete action spaces.Parameters: - dqn – A
tianshou.core.value_function.DQN
. The Q-value network to derive this policy. - epsilon_train – A float in range \([0, 1]\). The \(\epsilon\) used in \(\epsilon\)-greedy during training while interacting with the environment.
- epsilon_test – A float in range \([0, 1]\). The \(\epsilon\) used in \(\epsilon\)-greedy during test while interacting with the environment.
-
act
(observation, my_feed_dict={})[source]¶ Return action given observation, with \(\epsilon\)-greedy using
self.epsilon_train
.Parameters: - observation – An array-like with rank the same as a single observation of the environment. Its “batch_size” is 1, but should not be explicitly set. This method will add the dimension of “batch_size” to the first dimension.
- my_feed_dict – Optional. A dict defaulting to empty. Specifies placeholders such as dropout and batch_norm except observation.
Returns: A numpy array. Action given the single observation. Its “batch_size” is 1, but should not be explicitly set.
-
act_test
(observation, my_feed_dict={})[source]¶ Return action given observation, with \(\epsilon\)-greedy using
self.epsilon_test
.Parameters: - observation – An array-like with rank the same as a single observation of the environment. Its “batch_size” is 1, but should not be explicitly set. This method will add the dimension of “batch_size” to the first dimension.
- my_feed_dict – Optional. A dict defaulting to empty. Specifies placeholders such as dropout and batch_norm except observation.
Returns: A numpy array. Action given the single observation. Its “batch_size” is 1, but should not be explicitly set.
-
q_net
¶ The DQN (
tianshou.core.value_function.DQN
) this policy based on.
-
reset
()¶ Reset the internal states of the policy. Does nothing by default.
-
set_epsilon_test
(epsilon)[source]¶ Set the \(\epsilon\) in \(\epsilon\)-greedy during training. :param epsilon: A float in range \([0, 1]\).
- dqn – A