A Glimpse at Stein’s Method, in Memory of Charles Stein

Charles Stein was one of the most active and influential Statisticians of the past century, with a career spanning seven decades including over sixty year as a professor in Stanford’s Statistics Department. Since his passing nearly a year ago, I have been meaning to write a blog post for the TCS audience describing some of his contributions that are relevant to our community. Rather than attempting to summarize all of the influences that his work has had on the theory community, I’ll focus on just one: an approach to proving central-limit theorems (and bounds on the rate of convergence of such theorems) referred to as Stein’s Method, which I have used in some of my work. Among the many different approaches to proving central-limit style theorems, Stein’s Method stands out as one of the most conceptually elegant, and versatile approaches. Indeed, Stein’s method lends itself to many settings where we wish to analyzing sums of dependent random variables, settings with high-dimensional random variables, and settings where we hope to prove convergence to non-Gaussian distributions.

Recall the setting of the most basic central limit theorem: Suppose X_1,X_2,\ldots are independent identical random variables, of mean 0 and variance 1, then \frac{1}{\sqrt{n}}\sum_{i=1}^n X_i converges (in a sense that we will make explicit later) to a standard Gaussian distribution as n goes to infinity.

What is the usual way of bounding the rate at which convergence occurs? One approach, known as the Lindeberg or hybridization method, is to iteratively replace each X_i with a standard Gaussian, and bound the effect that each of these substitutions incurs. Another standard approach is to note that the density function of the sum or independent random variables is the convolution of their density functions, and hence the Fourier transform of the density of the sum is the product of the densities, and directly analyze the distance to the Fourier transform of the Gaussian (which is Gaussian!). Both of these approaches, however, seem to rely heavily on the independence of the X_i's.

The usual starting place in describing Stein’s Method is the fact that, for a Gaussian random variable, Z, for any (differentiable) function f, \mathbb{E}[f'(Z)-Zf(Z)] = 0, where f' denotes the derivative of f. Furthermore, if a random variable satisfies this equation for all functions f, then it is a standard Gaussian! The idea is then to argue that, for any random variable X, the extent to which \mathbb{E}[f'(X)-Zf(X)] is not zero can be related to how far X is from Gaussian. An easy integration will establish the fact that a Gaussian satisfies this equation (and is the only distribution satisfying it). Before continuing further down this path, we will take a step back and see a more geometric (and enlightening) explanation for why \mathbb{E}[f'(Z)-Zf(Z)] = 0. This alternate viewpoint will also also provide a more general framing of Stein’s method that provides some indication of why we might expect Stein’s method to be so adaptable to different settings.

At the highest level, Stein’s Method can be thought of as the following general recipe: Suppose you want to show that some distribution, D, is close to some “nice” distribution, G.

  •  Find a transformation (a mapping from distributions to distributions), T, for which G is the unique fixed point.
  • Next, show that the amount by which a distribution is changed by applying T is related to the distance between the distribution and G. [As a sanity check, the distance from G to itself is 0, and T(G)=G.]
  • Finally, apply T to the distribution we care about, and watch closely and see how much it is changed.

Crucially, the above approach turns the problem of comparing D directly to G, into the problem of comparing D to a transformed version of D, namely T(D). This is one of the main reasons for the versatility of Stein’s Method. While it shouldn’t be obvious why it is often easier to compare D to T(D) than to directly compare D to G, a conceptual explanation is that comparing D and T(D) lets one keep whatever complicated structure exists in D—for example if D is a sum of dependent random variables, one might not need to disentangle these dependencies in order to make this comparison. Furthermore, as we will see, we actually won’t even need to transform D, we can instead simply transform the lens that we are using to view the distribution, namely transforming the set of “test functions” that correspond to the metric in which we are hoping to similarity between D and G.

Lets now unpack this very general recipe in the case that G is a Gaussian. What is a transformation that has the standard Gaussian as a fixed point? There are many possibilities, but lets consider the transformation that consists of adding a small amount of Gaussian “noise” (i.e. convolving the density by a tiny Gaussian), and then linearly rescaling the distribution so that the variance of a univariate distribution with zero mean is left unchanged. This transformation is equally valid in the multivariate setting, but lets focus on distributions over \mathbb{R}. Concretely, let T_{\sigma}(D) be the transformation that maps D to the distribution corresponding to the random variable \frac{1}{\sqrt{1+\sigma^2}}(X+G_{\sigma}), where X is drawn from D, and G_{\sigma} is an independent random variable drawn from N(0,\sigma^2). Since the sum of independent Gaussians is Gaussian, and the variance of the sum of independent random variables is the sum of the variances, T_{\sigma}(G)=G, where G is the standard Gaussian.  And it is not hard to see that G is the unique fixed point of this transformation.

The connection between this transformation and the fact that \mathbb{E}[f'(Z)-Zf(Z)]=0 for any function f if Z is a standard Gaussian random variables come from considering the transformation T_{\sigma} as \sigma \rightarrow 0. [This infinitesimal transformation, when iteratively applied and viewed as a process, is known as the Ornstein-Uhlenbeck process.]  For a function g, consider how \mathbb{E}[g(X)] changes if we apply this infinitesimal transformation to X. We can directly analyze the effect of the two components of T_{\epsilon}, namely the addition of Gaussian noise, and then the rescaling by a factor of 1/\sqrt(1+\epsilon^2). To analyze this first component, note that if Z_\epsilon is distributed according to N(0,\epsilon^2), then, to first order, \mathbb{E}[g(X + Z_{\epsilon})] \approx \mathbb{E}[g(X)] +\frac{\epsilon^2}{2}\mathbb{E}[g''(Z)]. As a sanity check, if g is linear, than this zero-mean noise will have no effect on the expectation, and if g has positive second derivative, this noise will increase the expectation. To analyze the second component, the scaling of X by a factor of 1/\sqrt{1+\epsilon^2} \approx 1-\epsilon^2/2, note that this will change X by \epsilon^2X/2, and hence, to first order, this will alter the expectation by \frac{\epsilon^2}{2}\mathbb{E}[Xg'(X)]. Putting these two pieces together yields that for any random variable, X,

\mathbb{E}[g(T_{\epsilon}(X)] \approx \mathbb{E}[g(X) + \frac{\epsilon^2}{2} g''(X) - \frac{\epsilon^2}{2} X g'(X)].

Hence, from the viewpoint of a function g, the amount by which the transformation T_{\epsilon} alters the random variable X is proportional to \mathbb{E}[g''(X)]-\mathbb{E}[Xg'(X)]. And, if X=Z is a standard Gaussian random variable, then we know this transformation preserves the distribution of Z, and hence \mathbb{E}[g''(Z)]-\mathbb{E}[Xg'(Z)] = 0. Letting the function f denote g', we have re-derived the fact that \mathbb{E}[f'(Z) - Z f(Z)]=0 via a completely intuitive argument!!

Our central limit theorems will now come from analyzing the quantity \mathbb{E}[g''(X)]-\mathbb{E}[Xg'(X)] for all functions g belonging to some set that depends on the distance metric in question.  (And, of course, we might as well drop one of the differentiations).  Suppose we would like to prove that X =\frac{1}{\sqrt{n}} \sum_{i=1}^n X_i is close to the standard Gaussian, Z according to some specific metric. For some family of test functions, H, suppose we wish to analyze d_{H}(X,Z) = \sup_{h \in H} \left( \mathbb{E}_{X}[h(X)] - \mathbb{E}_Z [h(Z)]\right). If H consists of the set of functions that are indicators of measurable sets, then d_H is total variation distance (\mathbb{E}ll_1 distance). If H consists of all indicators of half-lines, then the corresponding distance metric is the distance between the cumulative distribution functions, as in the usual Berry-Esseen bound. If H is the set of all Lipschitz-1 functions, then the distance metric corresponds to Wasserstein distance (also referred to as earth-mover distance). Stein’s method can be made to work for many distance metrics, though it is especially easy to work with Wasserstein distance, as the continuity of the test functions play well with the differentiation, etc.

So how do we actually get a CLT out of this all? How can we relate the distance between the distribution of X and a Gaussian to some expression involving the quantity \mathbb{E}[f'(X)]-\mathbb{E}[Xf(X)]? Given a function h, we can attempt to solve the differentiable equation: f'(x) - xf(x) = h(x) - \mathbb{E}[h(Z)]. Note that the expectation of the right side of this equation, for x distributed according to X, is precisely the difference in expectations that we would like to analyze, and the left side is the equation that we know is 0 in expectation if x is distributed according to Z, and which will tell us how far X is from Z. This differential equation is fairly easy to solve for f, yielding the solution:

f_h(x) = e^{x^2}/2 \int_{\-\infty}^x \left(h(t)-\mathbb{E}[h(Z)]\right)e^{-t^2/2} dt.

Hence, we have now established that d_H(X,Z) = \sup_{h \in H} \left(\mathbb{E}[h(X)]-\mathbb{E}[h(Z)] \right)= \sup_{h \in H} \left( \mathbb{E}[f'_h(X)-Xf_h(X)]\right).

To finish, we leverage the properties of X, together with an understanding of how f_h is related to h to bound the quantity on the right side. In the case that X = \frac{1}{\sqrt{n}} \sum_{i=1}^n X_i for i.i.d. X_i‘s, this will be a three line argument where we compute the linear Taylor expansions of Xf_h(X) and f'_h(X) about X_{-1} = \frac{1}{\sqrt{n}}\sum_{i=2}^n X_i:

\mathbb{E}[Xf_h(X)]=\mathbb{E}[n\frac{1}{\sqrt{n}}X_1 f_h(X)]=\frac{1}{\sqrt{n}} \mathbb{E}[X_1 f_h(\frac{1}{\sqrt{n}}X_1 + X_{-1})]

= \frac{1}{\sqrt{n}} \mathbb{E}[X_1 f_h(\frac{1}{\sqrt{n}}X_1 + X_{-1})]
=  \sqrt{n} \mathbb{E}\left[X_1 \left(f_h(X_{-1}) + \frac{1}{\sqrt{n}}X_1f'_h(X_{-1})\right)\right] + err_1,

where err_1 is the error term in the linear approximation, and is bounded in magnitude by |err_1| \le \frac{1}{\sqrt{n}}\mathbb{E}[|X_1|^3] \|f''_h\|. Leveraging the independence of X_1 and X_{-1}, and the assumption that X_1 has zero mean and unit variance, the above simplifies to \mathbb{E}[f'_h(X_{-1})] + err_1. We now analyze the second term, \mathbb{E}[f'_h(X)], via a similar expansion:

\mathbb{E}[f'_h(X)] = \mathbb{E}[f'_h(X_{-1}+\frac{1}{\sqrt{n}}X_1)] = \mathbb{E}[f'_h[X_{-1}]+err_2,

where err_2 is bounded in magnitude as |err_2| \le \frac{1}{\sqrt{n}}\mathbb{E}[|X_1|] \|f''_h\|. Combining these expressions, we have established the following:

d_H(X,Z) = \sum_{h \in H}\left(\mathbb{E}[h(X)]-\mathbb{E}[h(Z)] \right) =\sum_{h \in H}\left( \mathbb{E}[f'_h(X)]-\mathbb{E}[Xf_h(X)] \right)

\le \sup_{h \in H} \frac{\|f''_h\|}{\sqrt{n}}(1+\mathbb{E}[|X_1|^3]).

If H is the class of indicators of half-lines, then we will need to work a bit harder to get a nice bound out of this expression. If, however, we care about Wasserstein distance, it is not hard to show that for any Lipschitz function h, \|f''_h\| \le 2, in which case this immediately yields the familiar O(1/\sqrt{n}) error bound for our central limit theorem!

For many other settings where one might wish to prove convergence bounds for central limit theorems, Stein’s method takes roughly the same form as the above argument. The multivariate analog proceeds along essentially the same route, beginning with the observation that the Gaussian is the unique fixed point for the multivariate analog of the transformation T_{\sigma} described above. To show convergence to a non-Gaussian distribution, the same model can also be applied. For example, to show convergence to a Poisson distribution of expectation \lambda, rather than using the expression f'(x)-xf(x) (which is the “characterizing operator” for the standard Gaussian distribution), one instead considers xf(x)-\lambda f(x+1), as the expectation of this expression is zero for every function f if, and only if, x is drawn from a Poisson distribution of expectation \lambda.

 

This blog post only begins to describe the tip of the elegant, deep, and multifaceted iceberg that is Stein’s Method. If your appetite is whetted and you would like to read more, I would suggest starting with Sourav Chatterjee’s “A Short Survey of Stein’s Method”. This survey, contains both a short history of Stein’s method, as well as a concise summary of the extensions and generalizations that have been developed over the past 30 years (together with some nice open questions along the lines of leveraging Stein’s Method to prove limiting behaviors of various graph theoretic random variables, such as the length of the shorted traveling salesman tour on certain families of random graphs). For a more detailed reference, see the fairly recent book of Chen, Goldstein, and Shao “Normal Approximation by Stein’s Method” (which is also available for download).

3 Comments on A Glimpse at Stein’s Method, in Memory of Charles Stein

  1. Thanks for the clear post (and great exposition)! If anything, I have a possibly trivial question about one step: when bounding the sum of the two errors, how to you obtain $1+\mathbb{E}[\lvert X_1\rvert^3]$ instead of $\mathbb{E}[\lvert X\rvert]+\mathbb{E}[\lvert X_1\rvert^3]$?

    Like

  2. Sorry, that was a bit of a typo—that err_2 term should have a subscript on the X, namely err_2 < (1/sqrt(n) E[|X_1|] |f''|, and by assumption, are assuming that X_i is bounded in magnitude by 1.

    Like

  3. I see — thank you.

    Like

Leave a comment