Inverse Transform Sampling in Python
When doing data work, we often need to sample random variables. This is easy to
do if one wishes to sample from a Gaussian, or a uniform random variable, or a
variety of other common distributions, but what if we want to sample from an
arbitrary distribution? There is no obvious way to do this within
scipy.stats
. So, I build a small library, inverse-transform-sample
,
that allows for sampling from arbitrary user provided distributions. In use, it
looks like this:
The code is available on GitHub. In this post, I’ll outline the theory of inverse transform sampling, discuss computational details, and outline some of the challenges faced in implementation.
Introduction to Inverse Transform Sampling
Suppose we have a probability density function \(p(x)\), which has an associated cumulative density function (CDF) \(F(x)\), defined as usual by
\[F(x) = \int_{-\infty}^x p(s)ds.\]Recall that the cumulative density function \(F(x)\) tells us the probability that a random sample from \(p\) is less than or equal to x.
Let’s take a second to notice something here. If we knew, for some x, that \(F(x)=t\), then drawing \(x\) from \(p\) is in some way equivalent to drawing \(t\) from a uniform random variable on \([0,1]\), since the CDF for a uniform random variable is \(F_u(t) = t\).1
That realization is the basis for inverse transform sampling. The procedure is:
- Draw a sample \(t\) uniformly from the inverval \([0,1]\).
- Solve the equation \(F(x)=t\) for \(x\) (invert the CDF).
- Return the resulting \(x\) as the sample from \(p\).
Computational Considerations
Most of the computational work done in the above algorithm comes in at step 2, in which the CDF is inverted.2 Consider Newton’s method, a typical routine for finding numerical solutions to equations: the approach is iterative, and so the function to be inverted, in our case the CDF \(F(x)\), is evaluated many times. Now, in our case, since \(F\) is a (numerically computed) integral of \(p\), this means that we will have to run our numerical quadrature routine once for each evaluation of \(F\). Since we need many evaluations of \(F\) for a single sample, this can lead to a significant slowdown in sampling.
Again, the pain point here is that our CDF \(F(x)\) is slow to evaluate, because each evaluation requires numerical quadrature. What we need is an approximation of the CDF that is fast to evaluate, as well as accurate.
Chebyshev Approximation of the CDF
I snooped around on the internet a bit, and found this feature request for scipy, which is related to this same issue. Although it never got off the ground, I found an interesting link to a 2013 paper by Olver & Townsend, in which they suggest using Chebyshev polynomials to approximate the PDF. The advantage of this approach is that the integral of a series of Chebyshev polynomials is known analytically - that is, if we know the Chebyshev expansion of the PDF, we automatically know the Chebyshev expansion of the CDF as well. This should allow us to rapidly invert the (Chebyshev approximation of the) CDF, and thus sample from the distribution efficiently.
Other Approaches
There are also less mathematically sophisticated approaches that immediately
present themselves. One might consider solving \(F(x)=t\) on a grid of \(t\)
values, and then building the function \(F^{-1}(x)\) by interpolation. One
could even simply transform the provided PDF into a histogram, and then use the
functionality built in to scipy.stats
for sampling from a provided histogram
(more on that later). However, due to time constraints,
inverse-transform-sample
only includes the numerical quadrature and Chebyshev
approaches.
Implementation in Python
The implementation of this approach is not horribly sophisticated, but in exchange it exhibits that wonderful readability characteristic of Python code. The complexity is the highest in the methods implementing the Chebyshev-based approach; those without a background in numerical analysis may wonder, for example, why the function is evaluted on that particularly strange set of nodes.
In the quadrature-based approach, both the numerical quadrature and root-finding
are both done via scipy
library (scipy.integrate.quad
and
scipy.optimize.root
, respectively). When using this approach, one can set the
boundaries of the PDF to be infinite, as scipy.integrate.quad
supports
improper integrals. In the notebook of examples, we show that the samples
generated by this approach do, at least in the eyeball norm, conform to the
provided PDF. As we expected, this approach is slow - it takes about 7 seconds to generate
5,000 samples from a unit normal.
As with the quadrature and root-finding, pre-rolled functional from scipy
was
used to both compute and evaluate the Chebyshev approximants. When approximating
a PDF using Chebyshev polynomials, finite bounds must be provided. A
user-determined tolerance determines the order of the Chebyshev approximation;
however, rather than computing a true error, we simply use the size of the last
few coefficients of the Chebyshev coefficients as an approximation. Since this
approach differs from the previousl only in the way that the CDF is constructed,
we use the same function sample
for both approaches; an option
chebyshev=True
will generate a Chebyshev approximant of the CDF, rather than
using numerical quadrature.
I hoped that the Chebyshev approach would improve on this by an order of
magnitude or two; however, my hopes were thwarted. The implementation of the
Chebyshev approach is faster by perhaps a factor of 2 or 3, but does not offer
the kind of improvement I had hoped for. What happened? In testing, a single
evaluation of the Chebyshev CDF was not much faster than a single evaluation of
the quadrature CDF. The advantage of the Chebyshev CDF comes when one wishes to
evaluate a long, vectorized set of inputs; in this case, the Chebyshev CDF is
orders of magnitude faster than quadrature. But scipy.optimize.root
does not
appear to take advantage of vectorization, which makes sense - in simple
iteration schemes, the value at which the next iteration occurs depends on the
outcome of the current iteration, so there is not a simple way to vectorize the
algorithm.
Conclusion
I suspect that the reason this feature is absent from large-scale library like
scipy
and numpy
is that it is difficult to build a sampler that is both fast
and accurate over a large enough class of PDFs. My approach sacrifices speed;
other approximation schemes may be very fast, but may not provide the accuracy
guarantees needed by some users.
What we’re left with is a library that is useful for generating small numbers
(less than 100,000) of samples. It’s worth noting that in the work of Olver &
Townsend, they seem to be able to use the Chebyshev approach to sample orders of
magnitude faster than my impelmentation, but sadly their Matlab code is nowhere
to be found in the Matlab library chebfun
, which is the location
advertised in their work. Presumably they implemented their own root-finder, or
Chebyshev approximation scheme, or both. There’s a lot of space for improvement
here, but I simply ran out of time and energy on this one; if you feel inspired,
fork the repo and submit a pull request!
Comments