tianshou.core.losses¶

tianshou.core.losses.ppo_clip(policy, clip_param)[source]¶

Builds the graph of clipped loss \(L^{CLIP}\) as in the PPO paper, which is basically \(-\min(r_t(\theta)A_t, \mathrm{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon)A_t)\). We minimize the objective instead of maximizing, hence the leading negative sign. It creates an action placeholder and an advantage placeholder and adds into the managed_placeholders of the policy.

Parameters:	policy – A `tianshou.core.policy` to be optimized. param (clip) – A float or Tensor of type float. The \(\epsilon\) in the loss equation.
Returns:	A scalar float Tensor of the loss.

tianshou.core.losses.REINFORCE(policy)[source]¶

Builds the graph of the loss function as used in vanilla policy gradient algorithms, i.e., REINFORCE. The loss is basically \(\log \pi(a|s) A_t\). We minimize the objective instead of maximizing, hence the leading negative sign. It creates an action placeholder and an advantage placeholder and adds into the managed_placeholders of the policy.

Parameters:	policy – A `tianshou.core.policy` to be optimized.
Returns:	A scalar float Tensor of the loss.

tianshou.core.losses.value_mse(value_function)[source]¶

Builds the graph of L2 loss on value functions for, e.g., training critics or DQN. It creates an placeholder for the target value adds it into the managed_placeholders of the value_function.

Parameters:	value_function – A `tianshou.core.value_function` to be optimized.
Returns:	A scalar float Tensor of the loss.

tianshou.core.opt¶

tianshou.core.opt.DPG(policy, action_value)[source]¶

Constructs the gradient Tensor of deterministic policy gradient.

Parameters:	policy – A `tianshou.core.policy.Deterministic` to be optimized. action_value – A `tianshou.core.value_function.ActionValue` to guide the optimization of policy.
Returns:	A list of (gradient, variable) pairs.