tianshou.data.advantage_estimation¶

tianshou.data.advantage_estimation.full_return(buffer, indexes=None)[source]¶

Naively compute full undiscounted return on episodic data, \(G_t = \sum_{t=0}^T r_t\). This function will print a warning when some of the episodes in buffer has not yet terminated.

Parameters:	buffer – A `tianshou.data.data_buffer`. indexes – Optional. Indexes of data points on which the full return should be computed. If not set, it defaults to all the data points in `buffer`. Note that if it’s the index of a sampled minibatch, it doesn’t have to be in order within each episode.
Returns:	A dict with key ‘return’ and value the computed returns corresponding to `indexes`.

class tianshou.data.advantage_estimation.nstep_return(n, value_function, return_advantage=False, discount_factor=0.99)[source]¶

Bases: object

Compute the n-step return from n-step rewards and bootstrapped state value function V(s), \(V(s_t) = r_t + \gamma r_{t+1} + ... + \gamma^{n-1} r_{t+n-1} + \gamma^n V(s_{t+n})\).

Parameters:

n – An int. The number of steps to lookahead, where \(n=1\) will directly apply V(s) to the next observation, as in the above equation.
value_function – A tianshou.core.value_function.StateValue. The V(s) as in the above equation
return_advantage – Optional. A bool defaulting to False. If True than this callable also returns the advantage function \(A(s_t) = r_t + \gamma r_{t+1} + ... + \gamma^{n-1} r_{t+n-1} + \gamma^n V(s_{t+n}) - V(s_t)\) when called.
discount_factor – Optional. A float in range \([0, 1]\) defaulting to 0.99. The discount factor \(\gamma\) as in the above equation.

__call__(buffer, indexes=None)[source]¶

Parameters:	buffer – A `tianshou.data.data_buffer`. indexes – Optional. Indexes of data points on which the specified return should be computed. If not set, it defaults to all the data points in `buffer`. Note that if it’s the index of a sampled minibatch, it doesn’t have to be in order within each episode.
Returns:	A dict with key ‘return’ and value the computed returns corresponding to `indexes`. If `return_advantage` set to `True` then also a key ‘advantage’ and value the corresponding advantages.

class tianshou.data.advantage_estimation.nstep_q_return(n, action_value, use_target_network=True, discount_factor=0.99)[source]¶

Bases: object

Compute the n-step return for Q-learning targets, \(G_t = r_t + \gamma \max_a Q'(s_{t+1}, a)\).

Parameters:

n – An int. The number of steps to lookahead, where \(n=1\) will directly apply \(Q'(s, \*)\) to the next observation, as in the above equation.
action_value – A tianshou.core.value_function.DQN. The \(Q'(s, \*)\) as in the above equation.
use_target_network – Optional. A bool defaulting to True. Whether to use the target networks in the above equation.
discount_factor – Optional. A float in range \([0, 1]\) defaulting to 0.99. The discount factor \(\gamma\) as in the above equation.

__call__(buffer, indexes=None)[source]¶

Parameters:	buffer – A `tianshou.data.data_buffer`. indexes – Optional. Indexes of data points on which the full return should be computed. If not set, it defaults to all the data points in `buffer`. Note that if it’s the index of a sampled minibatch, it doesn’t have to be in order within each episode.
Returns:	A dict with key ‘return’ and value the computed returns corresponding to `indexes`.

class tianshou.data.advantage_estimation.ddpg_return(actor, critic, use_target_network=True, discount_factor=0.99)[source]¶

Bases: object

Compute the return as in DDPG, \(G_t = r_t + \gamma Q'(s_{t+1}, \mu'(s_{t+1}))\), where \(Q'\) and \(\mu'\) are the target networks.

Parameters:

actor – A tianshou.core.policy.Deterministic. A deterministic policy.
critic – A tianshou.core.value_function.ActionValue. An action value function Q(s, a).
use_target_network – Optional. A bool defaulting to True. Whether to use the target networks in the above equation.
discount_factor – Optional. A float in range \([0, 1]\) defaulting to 0.99. The discount factor \(\gamma\) as in the above equation.

__call__(buffer, indexes=None)[source]¶

Parameters:	buffer – A `tianshou.data.data_buffer`. indexes – Optional. Indexes of data points on which the specified return should be computed. If not set, it defaults to all the data points in `buffer`. Note that if it’s the index of a sampled minibatch, it doesn’t have to be in order within each episode.
Returns:	A dict with key ‘return’ and value the computed returns corresponding to `indexes`.