Cross Entropy Method

1 minute read

This post introduces a parallelized implementation of the cross entropy method (CEM). CEM is often recommended as a first step before using a more complex method like reinforcement learning. The source code is available on GitHub.

CEM optimizes parameters by:

  • sampling parameters from a distribution
  • evaluating parameters using total episode reward
  • selecting the elite parameters
  • refitting the sampling distribution using the elite parameters
  • repeat

The sampling distribution is refit and sampled from using statistics (mean and standard deviation) from the elite population:

thetas = np.random.multivariate_normal(
	cov=np.diag(np.array(stds**2) + extra_cov),

The advantages of CEM are:

  • simple
  • gradient free
  • stable across random seeds
  • easily parallelizable

The disadvantages are:

  • only learn from entire episode trajectories (not individual actions)
  • struggles with long horizon problems
  • open loop planning only - can be suboptimal in stochastic environments


Results for the gym environments CartPole-v0 and Pendulum-v0. The standard deviation of the rewards shows how the elite population eventually becomes homogeneous.


$ python cartpole --num_process 6 --epochs 8 --batch_size 4096


$ python pendulum --num_process 6 --epochs 15 --batch_size 4096

Features of library

Parallelism over multiple process is achieved using Python’s multiprocessing library:

from multiprocessing import Pool
from functools import partial

#  need partial to send a fixed parameter into evaluate_theta()
with Pool(num_process) as p:
	rewards =, env_id=env_id), thetas)

Efficient sorting of parameters after evaluation in the environment is done using a binary heap:

import heapq

def get_elite_indicies(num_elite, rewards):
    return heapq.nlargest(num_elite, range(len(rewards)), rewards.take)

Thanks for reading!