tianshou.core.losses¶
-
tianshou.core.losses.ppo_clip(policy, clip_param)[source]¶ Builds the graph of clipped loss \(L^{CLIP}\) as in the PPO paper, which is basically \(-\min(r_t(\theta)A_t, \mathrm{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon)A_t)\). We minimize the objective instead of maximizing, hence the leading negative sign. It creates an action placeholder and an advantage placeholder and adds into the
managed_placeholdersof thepolicy.Parameters: - policy – A
tianshou.core.policyto be optimized. - param (clip) – A float or Tensor of type float. The \(\epsilon\) in the loss equation.
Returns: A scalar float Tensor of the loss.
- policy – A
-
tianshou.core.losses.REINFORCE(policy)[source]¶ Builds the graph of the loss function as used in vanilla policy gradient algorithms, i.e., REINFORCE. The loss is basically \(\log \pi(a|s) A_t\). We minimize the objective instead of maximizing, hence the leading negative sign. It creates an action placeholder and an advantage placeholder and adds into the
managed_placeholdersof thepolicy.Parameters: policy – A tianshou.core.policyto be optimized.Returns: A scalar float Tensor of the loss.
-
tianshou.core.losses.value_mse(value_function)[source]¶ Builds the graph of L2 loss on value functions for, e.g., training critics or DQN. It creates an placeholder for the target value adds it into the
managed_placeholdersof thevalue_function.Parameters: value_function – A tianshou.core.value_functionto be optimized.Returns: A scalar float Tensor of the loss.
tianshou.core.opt¶
-
tianshou.core.opt.DPG(policy, action_value)[source]¶ Constructs the gradient Tensor of deterministic policy gradient.
Parameters: - policy – A
tianshou.core.policy.Deterministicto be optimized. - action_value – A
tianshou.core.value_function.ActionValueto guide the optimization of policy.
Returns: A list of (gradient, variable) pairs.
- policy – A