# Cross Entropy Method

This post introduces a parallelized implementation of the cross entropy method (CEM). CEM is often recommended as a first step before using a more complex method like reinforcement learning. The source code is available on GitHub.

CEM optimizes parameters by:

- sampling parameters from a distribution
- evaluating parameters using total episode reward
- selecting the elite parameters
- refitting the sampling distribution using the elite parameters
- repeat

The sampling distribution is refit and sampled from using statistics (mean and standard deviation) from the elite population:

```
thetas = np.random.multivariate_normal(
mean=means,
cov=np.diag(np.array(stds**2) + extra_cov),
size=batch_size
)
```

The advantages of CEM are:

- simple
- gradient free
- stable across random seeds
- easily parallelizable

The disadvantages are:

- only learn from entire episode trajectories (not individual actions)
- struggles with long horizon problems
- open loop planning only - can be suboptimal in stochastic environments

## Performance

Results for the gym environments `CartPole-v0`

and `Pendulum-v0`

. The standard deviation of the rewards shows how the elite population eventually becomes homogeneous.

## Cartpole

```
$ python cem.py cartpole --num_process 6 --epochs 8 --batch_size 4096
```

## Pendulum

```
$ python cem.py pendulum --num_process 6 --epochs 15 --batch_size 4096
```

## Features of library

Parallelism over multiple process is achieved using Pythonâ€™s `multiprocessing`

library:

```
from multiprocessing import Pool
from functools import partial
# need partial to send a fixed parameter into evaluate_theta()
with Pool(num_process) as p:
rewards = p.map(partial(evaluate_theta, env_id=env_id), thetas)
```

Efficient sorting of parameters after evaluation in the environment is done using a binary heap:

```
import heapq
def get_elite_indicies(num_elite, rewards):
return heapq.nlargest(num_elite, range(len(rewards)), rewards.take)
```

Thanks for reading!