tianshou.data.advantage_estimation¶
-
tianshou.data.advantage_estimation.
full_return
(buffer, indexes=None)[source]¶ Naively compute full undiscounted return on episodic data, \(G_t = \sum_{t=0}^T r_t\). This function will print a warning when some of the episodes in
buffer
has not yet terminated.Parameters: - buffer – A
tianshou.data.data_buffer
. - indexes – Optional. Indexes of data points on which the full return should be computed.
If not set, it defaults to all the data points in
buffer
. Note that if it’s the index of a sampled minibatch, it doesn’t have to be in order within each episode.
Returns: A dict with key ‘return’ and value the computed returns corresponding to
indexes
.- buffer – A
-
class
tianshou.data.advantage_estimation.
nstep_return
(n, value_function, return_advantage=False, discount_factor=0.99)[source]¶ Bases:
object
Compute the n-step return from n-step rewards and bootstrapped state value function V(s), \(V(s_t) = r_t + \gamma r_{t+1} + ... + \gamma^{n-1} r_{t+n-1} + \gamma^n V(s_{t+n})\).
Parameters: - n – An int. The number of steps to lookahead, where \(n=1\) will directly apply V(s) to the next observation, as in the above equation.
- value_function – A
tianshou.core.value_function.StateValue
. The V(s) as in the above equation - return_advantage – Optional. A bool defaulting to
False
. IfTrue
than this callable also returns the advantage function \(A(s_t) = r_t + \gamma r_{t+1} + ... + \gamma^{n-1} r_{t+n-1} + \gamma^n V(s_{t+n}) - V(s_t)\) when called. - discount_factor – Optional. A float in range \([0, 1]\) defaulting to 0.99. The discount factor \(\gamma\) as in the above equation.
-
__call__
(buffer, indexes=None)[source]¶ Parameters: - buffer – A
tianshou.data.data_buffer
. - indexes – Optional. Indexes of data points on which the specified return should be computed.
If not set, it defaults to all the data points in
buffer
. Note that if it’s the index of a sampled minibatch, it doesn’t have to be in order within each episode.
Returns: A dict with key ‘return’ and value the computed returns corresponding to
indexes
. Ifreturn_advantage
set toTrue
then also a key ‘advantage’ and value the corresponding advantages.- buffer – A
-
class
tianshou.data.advantage_estimation.
nstep_q_return
(n, action_value, use_target_network=True, discount_factor=0.99)[source]¶ Bases:
object
Compute the n-step return for Q-learning targets, \(G_t = r_t + \gamma \max_a Q'(s_{t+1}, a)\).
Parameters: - n – An int. The number of steps to lookahead, where \(n=1\) will directly apply \(Q'(s, \*)\) to the next observation, as in the above equation.
- action_value – A
tianshou.core.value_function.DQN
. The \(Q'(s, \*)\) as in the above equation. - use_target_network – Optional. A bool defaulting to
True
. Whether to use the target networks in the above equation. - discount_factor – Optional. A float in range \([0, 1]\) defaulting to 0.99. The discount factor \(\gamma\) as in the above equation.
-
__call__
(buffer, indexes=None)[source]¶ Parameters: - buffer – A
tianshou.data.data_buffer
. - indexes – Optional. Indexes of data points on which the full return should be computed.
If not set, it defaults to all the data points in
buffer
. Note that if it’s the index of a sampled minibatch, it doesn’t have to be in order within each episode.
Returns: A dict with key ‘return’ and value the computed returns corresponding to
indexes
.- buffer – A
-
class
tianshou.data.advantage_estimation.
ddpg_return
(actor, critic, use_target_network=True, discount_factor=0.99)[source]¶ Bases:
object
Compute the return as in DDPG, \(G_t = r_t + \gamma Q'(s_{t+1}, \mu'(s_{t+1}))\), where \(Q'\) and \(\mu'\) are the target networks.
Parameters: - actor – A
tianshou.core.policy.Deterministic
. A deterministic policy. - critic – A
tianshou.core.value_function.ActionValue
. An action value function Q(s, a). - use_target_network – Optional. A bool defaulting to
True
. Whether to use the target networks in the above equation. - discount_factor – Optional. A float in range \([0, 1]\) defaulting to 0.99. The discount factor \(\gamma\) as in the above equation.
-
__call__
(buffer, indexes=None)[source]¶ Parameters: - buffer – A
tianshou.data.data_buffer
. - indexes – Optional. Indexes of data points on which the specified return should be computed.
If not set, it defaults to all the data points in
buffer
. Note that if it’s the index of a sampled minibatch, it doesn’t have to be in order within each episode.
Returns: A dict with key ‘return’ and value the computed returns corresponding to
indexes
.- buffer – A
- actor – A