tianshou.data.advantage_estimation

tianshou.data.advantage_estimation.full_return(buffer, indexes=None)[source]

Naively compute full undiscounted return on episodic data, \(G_t = \sum_{t=0}^T r_t\). This function will print a warning when some of the episodes in buffer has not yet terminated.

Parameters:
  • buffer – A tianshou.data.data_buffer.
  • indexes – Optional. Indexes of data points on which the full return should be computed. If not set, it defaults to all the data points in buffer. Note that if it’s the index of a sampled minibatch, it doesn’t have to be in order within each episode.
Returns:

A dict with key ‘return’ and value the computed returns corresponding to indexes.

class tianshou.data.advantage_estimation.nstep_return(n, value_function, return_advantage=False, discount_factor=0.99)[source]

Bases: object

Compute the n-step return from n-step rewards and bootstrapped state value function V(s), \(V(s_t) = r_t + \gamma r_{t+1} + ... + \gamma^{n-1} r_{t+n-1} + \gamma^n V(s_{t+n})\).

Parameters:
  • n – An int. The number of steps to lookahead, where \(n=1\) will directly apply V(s) to the next observation, as in the above equation.
  • value_function – A tianshou.core.value_function.StateValue. The V(s) as in the above equation
  • return_advantage – Optional. A bool defaulting to False. If True than this callable also returns the advantage function \(A(s_t) = r_t + \gamma r_{t+1} + ... + \gamma^{n-1} r_{t+n-1} + \gamma^n V(s_{t+n}) - V(s_t)\) when called.
  • discount_factor – Optional. A float in range \([0, 1]\) defaulting to 0.99. The discount factor \(\gamma\) as in the above equation.
__call__(buffer, indexes=None)[source]
Parameters:
  • buffer – A tianshou.data.data_buffer.
  • indexes – Optional. Indexes of data points on which the specified return should be computed. If not set, it defaults to all the data points in buffer. Note that if it’s the index of a sampled minibatch, it doesn’t have to be in order within each episode.
Returns:

A dict with key ‘return’ and value the computed returns corresponding to indexes. If return_advantage set to True then also a key ‘advantage’ and value the corresponding advantages.

class tianshou.data.advantage_estimation.nstep_q_return(n, action_value, use_target_network=True, discount_factor=0.99)[source]

Bases: object

Compute the n-step return for Q-learning targets, \(G_t = r_t + \gamma \max_a Q'(s_{t+1}, a)\).

Parameters:
  • n – An int. The number of steps to lookahead, where \(n=1\) will directly apply \(Q'(s, \*)\) to the next observation, as in the above equation.
  • action_value – A tianshou.core.value_function.DQN. The \(Q'(s, \*)\) as in the above equation.
  • use_target_network – Optional. A bool defaulting to True. Whether to use the target networks in the above equation.
  • discount_factor – Optional. A float in range \([0, 1]\) defaulting to 0.99. The discount factor \(\gamma\) as in the above equation.
__call__(buffer, indexes=None)[source]
Parameters:
  • buffer – A tianshou.data.data_buffer.
  • indexes – Optional. Indexes of data points on which the full return should be computed. If not set, it defaults to all the data points in buffer. Note that if it’s the index of a sampled minibatch, it doesn’t have to be in order within each episode.
Returns:

A dict with key ‘return’ and value the computed returns corresponding to indexes.

class tianshou.data.advantage_estimation.ddpg_return(actor, critic, use_target_network=True, discount_factor=0.99)[source]

Bases: object

Compute the return as in DDPG, \(G_t = r_t + \gamma Q'(s_{t+1}, \mu'(s_{t+1}))\), where \(Q'\) and \(\mu'\) are the target networks.

Parameters:
  • actor – A tianshou.core.policy.Deterministic. A deterministic policy.
  • critic – A tianshou.core.value_function.ActionValue. An action value function Q(s, a).
  • use_target_network – Optional. A bool defaulting to True. Whether to use the target networks in the above equation.
  • discount_factor – Optional. A float in range \([0, 1]\) defaulting to 0.99. The discount factor \(\gamma\) as in the above equation.
__call__(buffer, indexes=None)[source]
Parameters:
  • buffer – A tianshou.data.data_buffer.
  • indexes – Optional. Indexes of data points on which the specified return should be computed. If not set, it defaults to all the data points in buffer. Note that if it’s the index of a sampled minibatch, it doesn’t have to be in order within each episode.
Returns:

A dict with key ‘return’ and value the computed returns corresponding to indexes.