The event will take place in the Mackenzie Room at the Jen-Hsun Huang Engineering Center (quite close to the CS department). More details, including parking can be found here.

Program:

10:00-10:30 Gathering and coffee

10:30-11:15 Lada Adamic, Facebook

11:30-12 Clément Canonne, Stanford

12 – 12:30 Rad Niazadeh, Stanford

12:30-1:30pm lunch

1:30-2 Thomas Steinke, IBM Almaden

2-3 Students’ short talks

3-3:30 break

3:30-4:00 Rina Panigrahy, Google

4:00-5:30 happy hour

Lada Adamic *How cascades grow*

Clément Canonne *Testing Conditional Independence of Discrete Distributions*

We study the problem of testing conditional independence for discrete distributions. Specifically, given samples from a discrete random variable on domain ,

we want to distinguish, with probability at least , between the case that and are conditionally independent given from the case that is -far, in -distance, from every distribution that has this property. Conditional independence is a concept of central importance in probability and statistics with a range of applications in various scientific domains. As such, the statistical task of testing conditional independence has been extensively studied in various forms within the statistics and econometrics communities for nearly a century. Perhaps surprisingly, this problem has not been previously considered in the framework of distribution property testing and in particular no tester with sublinear sample complexity is known, even for the important special case that the domains of and are binary.

we want to distinguish, with probability at least , between the case that and are conditionally independent given from the case that is -far, in -distance, from every distribution that has this property. Conditional independence is a concept of central importance in probability and statistics with a range of applications in various scientific domains. As such, the statistical task of testing conditional independence has been extensively studied in various forms within the statistics and econometrics communities for nearly a century. Perhaps surprisingly, this problem has not been previously considered in the framework of distribution property testing and in particular no tester with sublinear sample complexity is known, even for the important special case that the domains of and are binary.

The main algorithmic result of this work is the first conditional independence tester with \emph{sublinear} sample complexity for

discrete distributions over .

To complement our upper bounds, we prove information-theoretic lower bounds establishing that the sample complexity of our algorithm is optimal, up to constant factors, for a number of settings (and in particular for the prototypical setting when ).

Joint work with Ilias Diakonikolas (USC), Daniel Kane (UCSD), and Alistair Stewart (USC).

Rad Niazadeh *Online auctions and multi-scale learning*

In this talk, I study revenue maximization in online auctions and pricing. A seller sells an identical item in each period to a new buyer, or a new set of buyers. For the online posted pricing problem, we show regret bounds that scale with the best fixed price, rather than the range of the values. We also show regret bounds that are almost scale free, when comparing to a benchmark that requires a lower bound on the market share. Moreover, we demonstrate a connection between the optimal regret bounds for this online problem and offline sample complexity lower-bounds of approximating optimal revenue, and we show our regret bounds are almost tight with respect to these information theoretic lower-bounds. Our online auctions and pricing are obtained by generalizing the classical learning from experts and multi-armed bandit problems to their “multi-scale versions”, where the reward of each action is in a different range. Here the objective is to design online learning algorithms whose regret with respect to a given action scales with its own range, rather than the maximum range.

Thomas Steinke* Less is more: Limiting information to guarantee generalization in adaptive data analysis*

Rina Panigrahy *Convergence Results for Neural Networks via Electrodynamics*

We study whether a depth two neural network can learn another depth two network using gradient descent. Assuming a linear output node, we show that the question of whether gradient descent converges to the target function is equivalent to the following question in electrodynamics: Given k fixed protons in R^d, and k electrons, each moving due to the attractive force from the protons and repulsive force from the remaining electrons, whether at equilibrium all the electrons will be matched up with the protons, up to a permutation. Under the standard electrical force, this follows from the classic Earnshaw’s theorem. In our setting, the force is determined by the activation function and the input distribution. Building on this equivalence, we prove the existence of an activation function such that gradient descent learns at least one of the hidden nodes in the target network. Iterating, we show that gradient descent can be used to learn the entire network one node at a time.

Joint work with Ali Rahimi, Sushant Sachdeva, Qiuyi Zhang

Website: http://academicjobsonline.org/ajo/jobs/10704

Email: theory.stanford@gmail.com

]]>

For the sake of exposition, I am going to skip over many of the details (as well as many of the results) in order to hopefully convey some of the interesting flavor to someone who is not already thinking about robust estimation.

To start, let us imagine an adversarial game between Alice (the attacker) and Bob (the learner). There is initially a “clean” dataset of points in , and Bob’s goal is to estimate the mean of . However, Alice is allowed to first adversarially corrupt the set in some way before Bob gets to see it. The question is: when does Bob have a strategy that allows him to output an estimate of with small error, no matter what Alice does?

We will consider two types of adversaries:

*Deletion adversaries*: If the clean set has elements, Alice is allowed to remove up to of the elements from before showing it to Bob.*Addition adversaries*: If the clean set has elements, Alice is allowed to add up to arbitrary elements to before showing it to Bob.

Below is a depiction of a possible strategy when Alice is an addition adversary:

The blue points are the clean data, and Bob wants to estimate the true mean (the green X). Alice has added outliers (in red) to try to fool Bob.

**Additions vs. deletions. **Intuitively, it seems like addition adversaries should be much more powerful than deletion adversaries—they can add arbitrary additional points to rather than only deleting existing points. However, the main point of this blog post is that **addition adversaries are actually always weaker than deletion adversaries.** More precisely, whenever the mean of a set is robust to deletions, there is a (exponential-time) algorithm for recovering the mean in the presence of arbitrary additions. The proof of this is a simple pigeonhole argument that I will go over in the next section.

**Note on high dimensions.** The naive strategy for handling outliers is to throw away all points that are far away in norm from the empirical mean. However, for say the -norm, this strategy will typically have error growing as in dimensions (since even for a Gaussian with identity covariance, most points have distance from the mean). We will see that more sophisticated strategies can do substantially better, obtaining dimension-independent error guarantees in many cases.

To formalize what we mean by robustness to deletions, we make the following definition:

**Definition (Resilience). **A set with mean is said to be -resilient in a norm if, for every subset of size at least , we have

In other words, a set is resilient if every large set (of at least a -fraction of the elements) has mean close to . This means that if any -fraction of elements is deleted the empirical mean of the remaining points will still have small distance to .

I claimed earlier that robustness to deletions implies robustness to additions. This is formalized in the following proposition:

**Proposition (Resilience Robustness).** If is -resilient, then there is an (exponential-time) algorithm for outputting a with , even if Alice is allowed to add arbitrary points.

**Proof.** The proof is a simple pigeonhole argument. Suppose that is the set of points that Bob observes, and that is the set of clean points, which is -resilient by assumption.

Now let be *any* -resilient subset of of size . (Such a set exists since is one such set.) We claim that the mean of any such is within of the mean of .

Indeed, by pigeonhole we must have . In particular, taking in the definition of resilience, we have

In other words, the mean of differs from the mean of by at most . But since is also resilient, the mean of differs from the mean of by at most as well. Therefore, by the triangle inequality the means of and are within , as claimed.

In summary, it suffices to find any large -resilient set and output its mean. This procedure will be robust even to an addition of an -fraction of outliers.

Resilience gives us a way of showing that certain robust estimation problems are possible. For instance, suppose that we have data points , where is a distribution with bounded covariance: , where is the covariance matrix of . Then one can show that as long as , the points are -resilient in the -norm with high probability (this is because any set whose empirical covariance is bounded in spectral norm is resilient). In particular, it is possible to recover the mean to error in the presence of an -fraction of outliers. Note that for many values of this is substantially better than the naive bound that grows as instead of .

More generally, if a distribution has bounded th moments, then samples from that distribution (for sufficiently large ) will be -resilient, while samples from a sub-Gaussian distribution will be -resilient.

Using other norms (such as the -norm) it is possible to get interesting results for problems with a more combinatorial flavor. In the paper, for instance, we show:

- The -norm gives results for robust learning of discrete distributions.
- A truncated -norm gives results for robustly learning stochastic block models. (Interestingly, it appears that the robust information-theoretic threshold probably matches the Kesten-Stigum threshold up to logarithmic factors).

The latter result on stochastic block models requires establishing the surprising fact that robust estimation is possible even with a *majority* of outliers. I will not go into detail here, but it is possible to show this using a modification of the pigeonhole argument above.

The problem of outlier-robust learning is very classical, going back at least to Tukey (1970). However, our interest here is in the high-dimensional setting, which surprisingly does not seem to have had satisfactory answers until quite recently. I believe part of this may be due to some historical accident of definitions—in the statistics literature following Tukey, many researchers were interested in developing estimators with good *breakdown points*. The breakdown point is defined as the maximum fraction of outliers tolerated before the estimator becomes meaningless (for instance, the median has a breakdown point of 50%, while the mean has a breakdown point of 0% because a single outlier can change it arbitrarily). While many estimators have very bad breakdown points, Donoho (1982) and Donoho & Gasko (1992) developed an estimator that had a very good breakdown point of essentially 50% (even in high dimensions). However, the error in the estimator could be as large as in the presence of an -fraction of outliers. Another estimator with good robustness properties is the Tukey median (Tukey, 1975), but this is NP-hard to compute (Johnson & Preparata, 1978).

It is only very recently that (computationally-efficient) estimators with small error in high dimensions were developed. Concurrent papers by Lai, Rao, & Vempala (2016) and Diakonikolas, Kamath, Kane, Li, Moitra, & Stewart (2016) showed how to robustly estimate the mean of various distributions in the presence of outliers, with error depending at most logarithmically on the dimension (DKKLMS16 get error completely independent of the dimension). My own interest in this problem came from considering robustness of crowdsourced data collection when some fraction of the raters are dishonest (SVC, 2016). I worked on this problem with Greg and Moses and we later realized that our techniques were actually fairly general and could be used for robustly solving arbitrary convex minimization problems (CSV, 2017).

However, most of this recent work uses fairly sophisticated algorithms and in general I suspect it is not easy for outsiders to this area to understand all of the intuition behind what is going on. This is what motivated considering the information-theoretic question in the previous section, because I think that once we are okay ignoring computational efficiency the picture becomes much clearer.

While I would be happy if the only thing you take away from this blog post is the proof that resilience implies robustness, if you are interested there is some other cool stuff in our paper. Specifically:

- We obtain computationally efficient algorithms in certain settings (including -norms for ).
- We show that the idea of resilience is applicable beyond mean estimation (in particular, for low-rank recovery).
- We show that for strongly convex norms, the properties of resilience and bounded covariance are closely linked.

To elaborate a bit more on the last point, it is not hard to show that any set whose empirical distribution has bounded covariance is also -resilient for all , where the value of depends on the covariance bound. However, it turns out that there is a converse provided the norm is strongly convex—given a set that is resilient in a strongly convex norm, it is always possible to delete a small number of points such that the remaining points have bounded covariance. The strong convexity assumption is actually important and the proof is a nice application of minimax duality combined with Khintchine’s decoupling inequality.

Anyways, hopefully this provides some encouragement to read the full paper, and we would be very interested in any questions or feedback (feel free to leave them in the comments).

[CSV17] M. Charikar, J. Steinhardt, and G. Valiant, Learning from untrusted data, Symposium on Theory of Computing (STOC), 2017.

[DKKLMS16] I. Diakonikolas, G. Kamath, D. Kane, J. Li, A. Moitra, and A. Stewart. Robust estimators in high dimensions without the computational intractability. In Foundations of Computer Science (FOCS), 2016.

[D82] D. L. Donoho. Breakdown properties of multivariate location estimators. Ph.D. qualifying paper, 1982.

[DG92] D. L. Donoho and M. Gasko. Breakdown properties of location estimates based on halfspace depth and projected outlyingness. Annals of Statistics, 20(4):1803–1827, 1992.

[LRV16] K. A. Lai, A. B. Rao, and S. Vempala. Agnostic estimation of mean and covariance. In Foundations of Computer Science (FOCS), 2016.

[SCV18] J. Steinhardt, M. Charikar, and G. Valiant, Resilience: A criterion for learning in the presence of arbitrary outliers, Innovations in Theoretical Computer Science (ITCS), 2018.

[SVC16] J. Steinhardt, G. Valiant, and M. Charikar, Avoiding imposters and delinquents: Adversarial crowd-sourcing and peer prediction, Advances in Neural Information Processing Systems (NIPS), 2016.

[T60] J. W. Tukey. A survey of sampling from contaminated distributions. Contributions to probability and statistics, 2:448–485, 1960.

[T75] J. W. Tukey. Mathematics and picturing of data. In ICM, volume 6, pages 523–531, 1975.

WIT is one of my favorite (if not *the* favorite) program in the theory community. I was lucky enough to participate in the outskirts of two WIT meetings, but naturally and rightfully never took part of its inner sanctum. So how can I vouch for an event I never fully attended? Experiencing the passion of the organizers, and most notably of Tal Rabin, and the enthusiastic reactions of its participants, there can be no doubt. I am far from being the only one feeling this way, and this year both Harvard and Stanford competed to host the workshop. So if you fit the workshop’s qualifications – please do yourself a favor and apply!

We will have a “Women In Theory” workshop for female graduate students and exceptional undergraduates (fourth year) in theoretical computer science at Harvard University, Cambridge, MA from Tuesday, June 19 to Friday, June 22 , 2018, __https://womenintheory.wordpress.com/__ .

We are writing to draw your attention to the workshop and ask you to inform female students in your department about this workshop. The workshop will have first-rate technical content and will be a great opportunity for students to meet their peers from around the world. We have received very enthusiastic feedback from participants in previous years and we think that this could be an exciting event for your student as well. Please forward this email to the female students in your department.

We will supply dorm rooms, breakfast and lunch, and will probably also be able to cover at least part if not all of their travel expenses. It would be great if you would be able to cover the remaining expenses.

For any questions, please email __womenintheory2018@gmail.com__.

The deadline to apply is January 16, 2018. Each student applicant needs to finish the application form on __https://womenintheory.wordpress.com/apply/__ , and her advisor also needs to supply a short letter of recommendation.

Best,

The organizing committee:

Tal Rabin

Shubhangi Saraf

Lisa Zhang

*‘’**If any one faculty of our nature may be called more wonderful than the rest, I do think it is memory’’ — Jane Austen*

In this blog, we’ll consider the sequential prediction problem of predicting the next observation given a sequence of past observations , and we’ll study this from the point of view of storing and referencing information about past observations in memory in order to predict future observations. We’ll start with something basic, and then make it more general. This is based on joint work with Sham Kakade, Percy Liang and Greg Valiant.

Consider a simple scenario, where the sequence of observations is just a sequence of *n* bits that keeps repeating (see Fig. 1), and this * *bit string is even known in advance. Suppose you get a sequence of observations from this model, and the task is to predict the output at the next time step, given only the previous * *outputs. Clearly if , then the task is trivial, because the outputs are periodic with period . Is a shorter window sufficient? If the bit string is chosen uniformly at random, then with high probability all length substrings of the string are unique, and therefore length sequences of observations are sufficient to uniquely identify the current position in the string, hence is sufficient to predict the outputs accurately.

Is there any hope of making good predictions when for *worst-case *bit strings? Note that observations from this model could certainly have dependencies across length time scales for some strings—for example consider the string which has a single 1 and 0s in all other positions, here we need to know the previous outputs to determine if the next output will be a 1. Despite these long-range dependencies, as it turns out, windows of observations are sufficient to make accurate predictions on average across time, both in this model of a repeating length * *string, as well as much more generally. Before describing these results and the intuition behind them, we provide some context for this basic question of *how to consolidate and reference memories about the past in order to effectively predict the future?*

Such questions about the importance of memory and how humans form memories have interested thinkers since back in the time of Plato. Our featured picture is from Christopher Nolan’s *Memento,* the protagonist here suffers from short-term memory loss approximately every five minutes, and the plot masterfully explores how our memories in some sense shape our reality. There has been considerable interest in the neurosciences community to understand how humans and animals create and retrieve memories to make accurate predictions about their environment (for example, see [1] and [2]). Closer to home, one can ask the question of how *algorithms* can consolidate and reference memories about the past in order to effectively predict the future.

This question of how to use memory (both how to figure out what to remember, and how to usefully query those memories) has been one of the most significant challenges that practical ML and NLP researchers have been grappling with recently. These efforts have led to a variety of neural network architectures that have an explicit notion of memory, including recurrent neural networks (RNNs), neural Turing machines, memory networks and Long Short-Term Memory (LSTM) networks (for a very nice introduction to these, see this). These advances have obtained some degree of practical success, but seem largely unable to consistently learn long-range dependencies, which are crucial in many settings including language. One amusing example of this is the recent sci-fi short film *Sunspring* whose script was automatically generated by a LSTM network. Locally, each sentence of the dialogue (mostly) makes sense, though there is no cohesion over longer time frames, and no overarching plot trajectory (despite the brilliant acting). It is an interesting watch — check it out below!

Many fundamental questions in this setting seem ripe for theoretical investigation. *i) How much memory is necessary to accurately predict future observations, and what properties of the underlying sequence determine this requirement? ii) Must one remember significant information about the distant past or is a short-term memory sufficient? iii) What is the computational complexity of accurate prediction? iv) How do answers to the above questions depend on the metric that is used to evaluate prediction accuracy? *We believe that answers to these questions could both guide the development of practical prediction systems, and help understand how prediction/learning takes place in nature.

In our recent work, we attempt to make progress on the first three questions. Perhaps surprisingly, we show that for a broad class of sequences, the “naive” algorithm that bases its predictions only on the most recent few observation together with a set of simple summary statistics of the past observations, predicts nearly as well as possible on *average*. The average error here is the error at every time step averaged across time, for a sufficiently large time window (average error is the most natural error metric, and is the metric ubiquitously used in practice). One concrete special case of our more general result concerns sequences generated according to a Hidden Markov Model (HMM) with at most hidden states (note that model in Fig. 1 corresponds to a very simple HMM with hidden states). We show that the naive prediction algorithm based on the empirical frequencies of length windows of observations achieves average error at most greater than the average error of the optimal predictor which knows the *entire* history of the sequence *and* the parameters of the underlying HMM; for this naive empirical model to achieve this error, the length of the sequence must be quite long, , where is the size of the observation alphabet.

This naive prediction algorithm is simply the -th order Markov model, which predicts the distribution of the next observation based on its conditional distribution given the previous observations, where this conditional distribution is estimated from the empirical frequencies observed so far. Note that this our is independent of the mixing time of the Markov model, and holds even when the Markov Chain does not mix.

Interestingly, this result shows that accurate prediction is possible even if the algorithm does not explicitly capture any long-range dependencies. Note that a state HMM can certainly represent dependencies of length (as in the model in Fig. 1); nevertheless, the predictor that only uses the most recent observations can achieve nearly optimal prediction error. One interpretation of these results is that, from the perspective of prediction, all that matters is the amount of ”dependencies”, not whether the dependencies are long-range or short-range. Supporting this interpretation, we also show the following general result: for any distribution over sequences of observations (not necessarily generated by a HMM), for which the mutual information between the entire past observations and future observations is bounded by , the best -th order Markov model obtains average KL error , or error with respect to the optimal predictions (note that for a HMM the mutual information is bounded by as bits are sufficient to specify the hidden state, and the hidden state encapsulates all the information about the past). We also show that it is information theoretically impossible to achieve a smaller error than this using only the previous observations.

The idea behind these results is most intuitive in the setting of a sequence generated according to an HMM with at most hidden states. In this case, at each time step , we either predict accurately (and are unsurprised when is revealed to us), or if we predict poorly and are surprised by the value of , then (in a sense that can be made rigorous) must contain a significant amount of information about the true hidden state. Because the hidden state can be specified via bits, this provides a bound on the number of errors that one expects to make. Check out the following video to explore this intuition more for the HMM in Fig. 1 —

These observations shed light on the striking power of the simple Markov model—it can obtain good predictions on average on *any* data-generating distribution given that its order scales with the mutual information of the sequence. These Markov models, with proper smoothing (i.e. “Kneyser-Ney Smoothing”), were essentially state of the art for natural language generation until till a few years back, and the result perhaps explains some of this success. It also strongly suggests that the widely used metric of *average error* may not be the right metric to train our algorithms if we want them to learn long-term dependencies, as the trivial Markov model can do pretty well on this metric—even though it is hampered by short-time memory loss (not unlike Memento’s protagonist!). More speculatively, some recent studies have claimed that many animals have very poor short term-memory, could it be that nature also opts for the Markov model when starved for resources?

As mentioned above, the data required to estimate a -th order Markov model for an observations alphabet of size is at least , as most sequences of observations might need to be observed. This prompts the question of whether it is possible to learn a successful predictor based on significantly less data. Without any additional assumptions on the structure of the sequence in question, the answer seems to be “no”. As we show, even for sequences generated from an * *state HMM, any computationally efficient algorithm that learns an -accurate predictor requires observations, assuming hardness of strongly refuting a certain class of CSPs. Read our full paper to find out more about these results!

The edit distance between two strings and is the minimum number of insertion, deletion, or substitution operations required to transform into . For example, the edit distance between “track” and “trek” is 2 (remove ‘a’ or ‘c’ and perform one substitution). One of the most important applications of edit distance is in computational biology, as a tool to determine how similar two genetic sequences are.

Computing edit distance is a textbook application of dynamic programming and can be performed in time for strings of length . The dynamic programming algorithm can be modified to output not just the edit distance itself, but also a corresponding sequence of edits. The quadratic runtime of the algorithm is prohibitively large for massive datasets (e.g., genomic data), and recent conditional lower bounds (by Backurs and Indyk and by Abboud, Hansen, Vassilevska Williams, and Williams) suggest that no algorithm with run time exists.

The challenge of efficiently computing edit distance has motivated the development of approximation algorithms which run in close to linear time. The state-of-the-art approximation algorithm, due to Andoni, Krauthgamer, and Onak, estimates edit distance within a factor of and runs in time . While these algorithms produce estimates of the edit distance, the equally important question of actually producing an alignment (i.e., the sequence of edits) has received far less attention. To the best of our knowledge, the current best approximation algorithm for producing an alignment has a polynomial multiplicative error.

The gap between algorithms for estimating edit distance and producing an alignment raises a natural question: Can the current estimation algorithms be used to produce an alignment, or are different techniques needed? In our recent work, we show that any estimation algorithm can be used in a *black-box* fashion to recover an alignment with modest loss in run time and approximation ratio. Informally, our result takes an estimation algorithm with approximation ratio that runs in time , and produces an algorithm which recovers a -approximate alignment in time . Plugging in the result of Andoni, Krauthgamer, and Onak, we can recover a -approximate alignment in time .

A high-level description of the algorithm is as follows. Let be an algorithm for edit distance estimation. Given strings and , we break into equal parts (for some ). We then examine at a low granularity the options for splitting into parts. Using , we approximate the distance between the -th part of and each of the options for the -th part of . Then, we use dynamic programming to find a partition of into parts which approximately minimizes the sum of the edit distances between parts of and . Finally, we recurse on each of the individual parts. The main ingredient in the analysis is showing that if we consider only a small number of options for each part in ‘s partition, we can still guarantee the existence of a partition which closely approximates the original edit distance between and .

]]>Recall the setting of the most basic central limit theorem: Suppose are independent identical random variables, of mean 0 and variance 1, then converges (in a sense that we will make explicit later) to a standard Gaussian distribution as goes to infinity.

What is the usual way of bounding the rate at which convergence occurs? One approach, known as the Lindeberg or hybridization method, is to iteratively replace each with a standard Gaussian, and bound the effect that each of these substitutions incurs. Another standard approach is to note that the density function of the sum or independent random variables is the convolution of their density functions, and hence the Fourier transform of the density of the sum is the product of the densities, and directly analyze the distance to the Fourier transform of the Gaussian (which is Gaussian!). Both of these approaches, however, seem to rely heavily on the independence of the .

The usual starting place in describing Stein’s Method is the fact that, for a Gaussian random variable, , for any (differentiable) function , , where denotes the derivative of . Furthermore, if a random variable satisfies this equation for all functions , then it is a standard Gaussian! The idea is then to argue that, for any random variable , the extent to which is not zero can be related to how far is from Gaussian. An easy integration will establish the fact that a Gaussian satisfies this equation (and is the only distribution satisfying it). Before continuing further down this path, we will take a step back and see a more geometric (and enlightening) explanation for why . This alternate viewpoint will also also provide a more general framing of Stein’s method that provides some indication of why we might expect Stein’s method to be so adaptable to different settings.

At the highest level, Stein’s Method can be thought of as the following general recipe: Suppose you want to show that some distribution, , is close to some “nice” distribution, .

- Find a transformation (a mapping from distributions to distributions), , for which is the unique fixed point.
- Next, show that the amount by which a distribution is changed by applying is related to the distance between the distribution and . [As a sanity check, the distance from to itself is 0, and .]
- Finally, apply to the distribution we care about, and watch closely and see how much it is changed.

Crucially, the above approach turns the problem of comparing directly to , into the problem of comparing to a transformed version of , namely . This is one of the main reasons for the versatility of Stein’s Method. While it shouldn’t be obvious why it is often easier to compare to than to directly compare to , a conceptual explanation is that comparing and lets one keep whatever complicated structure exists in —for example if is a sum of dependent random variables, one might not need to disentangle these dependencies in order to make this comparison. Furthermore, as we will see, we actually won’t even need to transform , we can instead simply transform the lens that we are using to view the distribution, namely transforming the set of “test functions” that correspond to the metric in which we are hoping to similarity between and .

Lets now unpack this very general recipe in the case that is a Gaussian. What is a transformation that has the standard Gaussian as a fixed point? There are many possibilities, but lets consider the transformation that consists of adding a small amount of Gaussian “noise” (i.e. convolving the density by a tiny Gaussian), and then linearly rescaling the distribution so that the variance of a univariate distribution with zero mean is left unchanged. This transformation is equally valid in the multivariate setting, but lets focus on distributions over . Concretely, let be the transformation that maps to the distribution corresponding to the random variable , where is drawn from , and is an independent random variable drawn from . Since the sum of independent Gaussians is Gaussian, and the variance of the sum of independent random variables is the sum of the variances, where is the standard Gaussian. And it is not hard to see that is the unique fixed point of this transformation.

The connection between this transformation and the fact that for any function if is a standard Gaussian random variables come from considering the transformation as . [This infinitesimal transformation, when iteratively applied and viewed as a process, is known as the Ornstein-Uhlenbeck process.] For a function , consider how changes if we apply this infinitesimal transformation to . We can directly analyze the effect of the two components of , namely the addition of Gaussian noise, and then the rescaling by a factor of . To analyze this first component, note that if is distributed according to then, to first order, . As a sanity check, if is linear, than this zero-mean noise will have no effect on the expectation, and if has positive second derivative, this noise will increase the expectation. To analyze the second component, the scaling of by a factor of note that this will change by , and hence, to first order, this will alter the expectation by . Putting these two pieces together yields that for any random variable, ,

Hence, from the viewpoint of a function , the amount by which the transformation alters the random variable is proportional to And, if is a standard Gaussian random variable, then we know this transformation preserves the distribution of , and hence . Letting the function denote , we have re-derived the fact that via a completely intuitive argument!!

Our central limit theorems will now come from analyzing the quantity for all functions belonging to some set that depends on the distance metric in question. (And, of course, we might as well drop one of the differentiations). Suppose we would like to prove that is close to the standard Gaussian, according to some specific metric. For some family of test functions, , suppose we wish to analyze . If consists of the set of functions that are indicators of measurable sets, then is total variation distance ( distance). If consists of all indicators of half-lines, then the corresponding distance metric is the distance between the cumulative distribution functions, as in the usual Berry-Esseen bound. If is the set of all Lipschitz-1 functions, then the distance metric corresponds to Wasserstein distance (also referred to as earth-mover distance). Stein’s method can be made to work for many distance metrics, though it is especially easy to work with Wasserstein distance, as the continuity of the test functions play well with the differentiation, etc.

So how do we actually get a CLT out of this all? How can we relate the distance between the distribution of and a Gaussian to some expression involving the quantity ? Given a function , we can attempt to solve the differentiable equation: Note that the expectation of the right side of this equation, for distributed according to , is precisely the difference in expectations that we would like to analyze, and the left side is the equation that we know is 0 in expectation if is distributed according to , and which will tell us how far is from Z. This differential equation is fairly easy to solve for , yielding the solution:

Hence, we have now established that

To finish, we leverage the properties of , together with an understanding of how is related to to bound the quantity on the right side. In the case that for i.i.d. ‘s, this will be a three line argument where we compute the linear Taylor expansions of and about :

where is the error term in the linear approximation, and is bounded in magnitude by Leveraging the independence of and , and the assumption that has zero mean and unit variance, the above simplifies to . We now analyze the second term, via a similar expansion:

where is bounded in magnitude as Combining these expressions, we have established the following:

If is the class of indicators of half-lines, then we will need to work a bit harder to get a nice bound out of this expression. If, however, we care about Wasserstein distance, it is not hard to show that for any Lipschitz function , in which case this immediately yields the familiar error bound for our central limit theorem!

For many other settings where one might wish to prove convergence bounds for central limit theorems, Stein’s method takes roughly the same form as the above argument. The multivariate analog proceeds along essentially the same route, beginning with the observation that the Gaussian is the unique fixed point for the multivariate analog of the transformation described above. To show convergence to a non-Gaussian distribution, the same model can also be applied. For example, to show convergence to a Poisson distribution of expectation , rather than using the expression (which is the “characterizing operator” for the standard Gaussian distribution), one instead considers , as the expectation of this expression is zero for every function if, and only if, is drawn from a Poisson distribution of expectation .

This blog post only begins to describe the tip of the elegant, deep, and multifaceted iceberg that is Stein’s Method. If your appetite is whetted and you would like to read more, I would suggest starting with Sourav Chatterjee’s “A Short Survey of Stein’s Method”. This survey, contains both a short history of Stein’s method, as well as a concise summary of the extensions and generalizations that have been developed over the past 30 years (together with some nice open questions along the lines of leveraging Stein’s Method to prove limiting behaviors of various graph theoretic random variables, such as the length of the shorted traveling salesman tour on certain families of random graphs). For a more detailed reference, see the fairly recent book of Chen, Goldstein, and Shao “Normal Approximation by Stein’s Method” (which is also available for download).

]]>