The sample efficiency of the RL algorithm, even for simple games, is not very go...

The sample efficiency of the RL algorithm, even for simple games, is not very good. This usually means that we will need a lot of episodes for the policy to learn to excel. Being able to run policy in an environment that can parallel and accelerate could be very helpful for the improvement - for example running a batch of browsers or tabs simultaneously :)