tianshou.core.losses¶
-
tianshou.core.losses.
ppo_clip
(policy, clip_param)[source]¶ Builds the graph of clipped loss \(L^{CLIP}\) as in the PPO paper, which is basically \(-\min(r_t(\theta)A_t, \mathrm{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon)A_t)\). We minimize the objective instead of maximizing, hence the leading negative sign. It creates an action placeholder and an advantage placeholder and adds into the
managed_placeholders
of thepolicy
.Parameters: - policy – A
tianshou.core.policy
to be optimized. - param (clip) – A float or Tensor of type float. The \(\epsilon\) in the loss equation.
Returns: A scalar float Tensor of the loss.
- policy – A
-
tianshou.core.losses.
REINFORCE
(policy)[source]¶ Builds the graph of the loss function as used in vanilla policy gradient algorithms, i.e., REINFORCE. The loss is basically \(\log \pi(a|s) A_t\). We minimize the objective instead of maximizing, hence the leading negative sign. It creates an action placeholder and an advantage placeholder and adds into the
managed_placeholders
of thepolicy
.Parameters: policy – A tianshou.core.policy
to be optimized.Returns: A scalar float Tensor of the loss.
-
tianshou.core.losses.
value_mse
(value_function)[source]¶ Builds the graph of L2 loss on value functions for, e.g., training critics or DQN. It creates an placeholder for the target value adds it into the
managed_placeholders
of thevalue_function
.Parameters: value_function – A tianshou.core.value_function
to be optimized.Returns: A scalar float Tensor of the loss.
tianshou.core.opt¶
-
tianshou.core.opt.
DPG
(policy, action_value)[source]¶ Constructs the gradient Tensor of deterministic policy gradient.
Parameters: - policy – A
tianshou.core.policy.Deterministic
to be optimized. - action_value – A
tianshou.core.value_function.ActionValue
to guide the optimization of policy.
Returns: A list of (gradient, variable) pairs.
- policy – A