tianshou.core.losses

tianshou.core.losses.ppo_clip(policy, clip_param)[source]

Builds the graph of clipped loss \(L^{CLIP}\) as in the PPO paper, which is basically \(-\min(r_t(\theta)A_t, \mathrm{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon)A_t)\). We minimize the objective instead of maximizing, hence the leading negative sign. It creates an action placeholder and an advantage placeholder and adds into the managed_placeholders of the policy.

Parameters:
  • policy – A tianshou.core.policy to be optimized.
  • param (clip) – A float or Tensor of type float. The \(\epsilon\) in the loss equation.
Returns:

A scalar float Tensor of the loss.

tianshou.core.losses.REINFORCE(policy)[source]

Builds the graph of the loss function as used in vanilla policy gradient algorithms, i.e., REINFORCE. The loss is basically \(\log \pi(a|s) A_t\). We minimize the objective instead of maximizing, hence the leading negative sign. It creates an action placeholder and an advantage placeholder and adds into the managed_placeholders of the policy.

Parameters:policy – A tianshou.core.policy to be optimized.
Returns:A scalar float Tensor of the loss.
tianshou.core.losses.value_mse(value_function)[source]

Builds the graph of L2 loss on value functions for, e.g., training critics or DQN. It creates an placeholder for the target value adds it into the managed_placeholders of the value_function.

Parameters:value_function – A tianshou.core.value_function to be optimized.
Returns:A scalar float Tensor of the loss.

tianshou.core.opt

tianshou.core.opt.DPG(policy, action_value)[source]

Constructs the gradient Tensor of deterministic policy gradient.

Parameters:
  • policy – A tianshou.core.policy.Deterministic to be optimized.
  • action_value – A tianshou.core.value_function.ActionValue to guide the optimization of policy.
Returns:

A list of (gradient, variable) pairs.