tianshou.data.data_collector¶

class tianshou.data.data_collector.DataCollector(env, policy, data_buffer, process_functions, managed_networks)[source]¶

Bases: object

A utility class to manage the data flow during the interaction between the policy and the environment. It stores data into data_buffer, processes the reward signals and returns the feed_dict for tf graph running.

Parameters:

env – An environment.
policy – A tianshou.core.policy.
data_buffer – A tianshou.data.data_buffer.
process_functions – A list of callables in tianshou.data.advantage_estimation to process rewards.
managed_networks – A list of networks of tianshou.core.policy and/or tianshou.core.value_function. The networks you want this class to manage. This class will automatically generate the feed_dict for all the placeholders in the managed_placeholders of all networks in this list.

collect(num_timesteps=0, num_episodes=0, my_feed_dict={}, auto_clear=True, episode_cutoff=None)[source]¶

Collect data in the environment using self.policy.

Parameters:

num_timesteps – An int specifying the number of timesteps to act. It defaults to 0 and either num_timesteps or num_episodes could be set but not both.
num_episodes – An int specifying the number of episodes to act. It defaults to 0 and either num_timesteps or num_episodes could be set but not both.
my_feed_dict – Optional. A dict defaulting to empty. Specifies placeholders such as dropout and batch_norm except observation and action.
auto_clear – Optional. A bool defaulting to True. If True then this method clears the self.data_buffer if self.data_buffer is an instance of tianshou.data.data_buffer.BatchSet. and does nothing if it’s not that instance. If set to False then the aforementioned auto clearing behavior is disabled.
episode_cutoff – Optional. An int. The maximum number of timesteps in one episode. This is useful when the environment has no terminal states or a single episode could be prohibitively long. If set than all episodes are forced to stop beyond this number to timesteps.

denoise_action(feed_dict, my_feed_dict={})[source]¶

Recompute the actions of deterministic policies without exploration noise, hence denoising. It modifies feed_dict in place and has no return value. This is useful in, e.g., DDPG since the stored action in self.data_buffer is the sampled action with additional exploration noise.

Parameters:	feed_dict – A dict. It has to be the dict returned by `next_batch()` by this class. my_feed_dict – Optional. A dict defaulting to empty. Specifies placeholders such as dropout and batch_norm except observation and action.

next_batch(batch_size, standardize_advantage=True)[source]¶

Constructs and returns the feed_dict of data to be used with sess.run.

Parameters:	batch_size – An int. The size of one minibatch. standardize_advantage – Optional. A bool but defaulting to `True`. If `True`, then this method standardize advantages if advantage is required by the networks. If `False` then this method will never standardize advantage.
Returns:	A dict in the format of conventional feed_dict in tf, with keys the placeholders and values the numpy arrays.