Edward Newell

For fast on-demand sampling from large categorical distributions, `pip install categorical`, and check out the API examples below, and the benchmark against numpy. Sampling from Many Discrete Outcomes: the Alias Method

Let’s suppose that you want to randomly draw one of \(k\) outcomes, according to probabilities \(p_1\), \(p_2\), ... \(p_k\). No problem, just divide up the interval \([0,1]\) into \(k\) pieces of corresponding sizes, then draw a sample from \(U(0,1)\) and see which interval it lands in.

... but what if there are a million possible outcomes?

The Simple Way

You’ve probably done this many times: you’re writing a program that needs stochasticity—with some probability \(p_A\), event \(A\) should occur, otherwise \(B\) should occur. You sample a pseudorandom number from \([0,1]\), and if that number is smaller than \(p_A\) you execute event \(A\), and otherwise \(B\). And if you’ve got three outcomes, or five or twenty, this generalizes nicely: you just divide up the interval \([0,1]\) into subintervals, or “bins”, whose sizes correspond to the the desired probabilities, sample your pseudorandom number, and then test which bin it falls in.

But the catch is that this requires scanning through the interval to see which bin the pseudorandom fell into. You end up with something like this:


import random

picker = random.uniform()	# a pseudorandom in [0,1]
if picker < p_1:
	outcome = 1
elif picker < p_2:
	outcome = 2
elif picker < p_3:
	outcome = 3
...
else:
	outcome = k

“On the order of \(k\) tests”: if you get lucky, and most of the interval is covered by a small number of outcomes, then you might do better than \(O(k)\). But that only works for distributions having particular properties. You could write it more compactly in a loop, but it doesn’t change the fact that you’ll need to do on the order of \(k\) tests. And if \(k\) is large, and you need to draw a lot of samples, this isn’t good!

I ran into the need to draw samples from a categorical distribution with larege support when I was implementing word2vec. In the training algorithm, you need to sample an English word randomly according to the frequency that words appear in a large corpus. Depending on how you count, there are hundreds of thousands of English words, or millions. Repetitively testing so many interval endpoints every time you sample a word is just too slow. But, there is a better way.

A Better Way: the Alias Method

It turns out that you can draw a sample according to your target categorical distribution in constant time. The approach builds off the fact that, if you had wanted to choose one of \(k\) equally likely possibilities, then sampling in constant time would be simple: Equivalently, you could call random.randint(), but for illustrative purposes, I’m assuming we’re building off a pseudorandom primitive that gives uniform samples from [0,1].


outcome = int(random.uniform() * k)

The trick in the alias method is to convert our sampling problem, in which outcomes have varied probabilities, into something more like sampling from a uniform distribution. To do that, we’re going to create alias-outcomes, each of which actually stands for a combination of two of our original outcomes. Then, we’ll have a way of figuring out which of our original outcomes should be produced from the alias-outcome in such a way that samples are generated according to the original distribution.

Setting up Alias-Outcomes

First, notice that, if our original distribution had been uniform, then each outcome would have probability \(1/k\). Since our distribution isn’t uniform, it means that some outcomes have more than \(1/k\) probability, and some have less. With that in mind, we can imagine constructing alias-outcomes by combining one of the outcomes having less than \(1/k\) with a small “donation” of some probability mass from one of the higher-probability outcomes, topping up the total probability for that alias outcome to \(1/k\). Then we use the leftover mass from the higher-probability outcome in constructing another alias-outcome. We can obtain a set of \(k\) alias-outcomes in this way, each of which was made by a donation of mass from one or two of our original outcomes (Figure 1):

Figure 1. The mass from the original outcomes (\(o_1\) through \(o_4\)) are allocated to alias-outcomes
(\(a_1\) through \(a_4\)), in such a way that each alias-outcome has a total probability of \(1/k\), and each has probability mass coming from just one or two of our original outcomes.

When building the alias outcomes, we need to keep track of which original outcomes contribute to which alias-outcomes, and how much. We can store this information in arrays of length \(k\), and it will take us \(O(k)\) to set up the arrays. But, we only need to do that once, and then, sampling can be done quickly as follows. First, we sample one of the alias-outcomes according to the uniform probability. This is fast because they are uniformly distributed. Then, we look up which original outcomes were used to make that alias, and how much each contributed. Since only two original outcomes makeup each alias, we can sample one of these using the “simple way” we saw in the opening paragraph. So, sampling from the original distribution can be done in constant time, by drawing two samples from \(U(0,1)\).

I find myself having to sample from categorical distributions fairly often. Numpy supports that with numpy.random.choice(), but depending on your usecase, it can be slow. The problem with numpy.random.choice() is that it seems to rebuild the sampler every time (although I don’t know how it’s implemented). If you know ahead of time that you want, say one million samples, then you can request them all at once, and this is fast. But if you can't or don't want to get your samples ahead of time, then numpy’s implementation is quite slow. Clearly they have an efficient sampler under the hood, so it seems like just a lapse in the API that the underlying sampler isn't exposed to the user.

Because of this, I packaged up an implementation called categorical that uses the alias method and provides the user with a sampler instance that can produce samples on demand faster than individual calls to numpy.random.choice(). If you know ahead of time how many samples you need, numpy’s implementation will serve you better. But if you don’t want to or can’t generate all the samples ahead of time, then my implementation will give you fast on-demand samples. You can include categorical in your projects with a simple pip install, see below.

`categorical` Installation and Usage

Install from pip: pip install categorical

Let’s generate a probability distribution to get us started. First, sample a bunch of random numbers to determine probability “scores”.


>>> from random import random
>>> k = 10**6
>>> scores = [random() for i in range(k)]
>>> total = sum(scores)
>>> probabilities = [s / total for s in scores]

We've normalized the scores to sum to 1, i.e. make them into proper probabilities, but actually the categorical sampler will do that for us, so it’s not necessary:


>>> from categorical import Categorical as C
>>> my_sampler = C(scores)
>>> print my_sampler.sample()
487702

The value 487702 above means that the sampler yielded the outcome corresponding to the 487703^rd outcome, as listed in the probability-scores provided to the constructor (outcomes are zero-indexed of course). Since we hadn’t normalized the probability scores in advance, we can ask the sampler what the probability of sampling that outcome was:


>>> my_sampler.get_probability(487702)
5.3028349767480922e-07

Pretty small indeed!

As a bonus, you can provide a shape parameter when calling sample, and get back a numpy array, having the specified shape, filled with independent samples:


>>> my_sampler.sample((3,4), dtype='int32')
array([[719441, 321772, 741180,  93447],
       [652309, 472908,  40968, 422349],
       [993621, 652970, 720311,  79800]])

Benchmarking Against `numpy`

Comparing to numpy, assuming we draw 1000 samples individually:


>>> from numpy.random import choice
>>> import time
>>> 
>>> def time_numpy():
>>> 	start = time.time()
>>> 	for i in range(1000):
>>> 		choice(k, p=probabilities)
>>> 	print time.time() - start
>>> 
>>> def time_my_alias():
>>> 	start = time.time()
>>> 	for i in range(1000):
>>> 		my_sampler.sample()
>>> 	print time.time() - start
>>> 
>>> time_numpy()
31.0555009842
>>> time_my_alias()
0.0127031803131

If instead, you draw your samples upfront, numpy might be faster, depending on how many samples you want. At around 10,000 samples taken in batch, numpy starts to get faster:


>>> def time_numpy():
>>>     start = time.time()
>>> 	choice(k, size=(10000,), p=probabilities)
>>> 	print time.time() - start
>>> 
>>> def time_my_alias():
>>>     start = time.time()
>>>     my_sampler.sample(shape=(10000,))
>>>     print time.time() - start
>>>
>>> time_numpy()
0.107708930969
>>> time_my_alias()
0.117538928986

So, if you are drawing more than 10,000 samples at a time, numpy will be faster, whereas if you are sampling fewer than that (especially if you do so repeatedly), then categorical will be faster.