When is a dataset robust to outliers?

Posted on December 14, 2017 by jsteinhardt in Uncategorized // 2 Comments

What properties of a dataset ensure that its mean can be recovered even in the presence of outliers? We answer this question in a recent ITCS paper “Resilience: A Criterion for Learning in the Presence of Arbitrary Outliers” by myself, Moses Charikar, and Greg Valiant. The purpose of this blog post is to give a brief overview of the paper.

For the sake of exposition, I am going to skip over many of the details (as well as many of the results) in order to hopefully convey some of the interesting flavor to someone who is not already thinking about robust estimation.

To start, let us imagine an adversarial game between Alice (the attacker) and Bob (the learner). There is initially a “clean” dataset $S$ of points in $\mathbb{R}^d$ , and Bob’s goal is to estimate the mean $\mu$ of $S$ . However, Alice is allowed to first adversarially corrupt the set $S$ in some way before Bob gets to see it. The question is: when does Bob have a strategy that allows him to output an estimate $\hat{\mu}$ of $\mu$ with small error, no matter what Alice does?

We will consider two types of adversaries:

Deletion adversaries: If the clean set $S$ has $n$ elements, Alice is allowed to remove up to $\epsilon n$ of the elements from $S$ before showing it to Bob.
Addition adversaries: If the clean set $S$ has $n$ elements, Alice is allowed to add up to $\epsilon n$ arbitrary elements to $S$ before showing it to Bob.

Below is a depiction of a possible strategy when Alice is an addition adversary:

The blue points are the clean data, and Bob wants to estimate the true mean (the green X). Alice has added outliers (in red) to try to fool Bob.

Additions vs. deletions. Intuitively, it seems like addition adversaries should be much more powerful than deletion adversaries—they can add arbitrary additional points to $S$ rather than only deleting existing points. However, the main point of this blog post is that addition adversaries are actually always weaker than deletion adversaries. More precisely, whenever the mean of a set is robust to deletions, there is a (exponential-time) algorithm for recovering the mean in the presence of arbitrary additions. The proof of this is a simple pigeonhole argument that I will go over in the next section.

Note on high dimensions. The naive strategy for handling outliers is to throw away all points that are far away in norm from the empirical mean. However, for say the $\ell_2$ -norm, this strategy will typically have error growing as $\epsilon \sqrt{d}$ in $d$ dimensions (since even for a Gaussian with identity covariance, most points have distance $\sqrt{d}$ from the mean). We will see that more sophisticated strategies can do substantially better, obtaining dimension-independent error guarantees in many cases.

Resilient Sets, and a Pigeonhole Argument

To formalize what we mean by robustness to deletions, we make the following definition:

Definition (Resilience). A set $S \subseteq \mathbb{R}^d$ with mean $\mu$ is said to be $(\sigma,\epsilon)$ -resilient in a norm $\|\cdot\|$ if, for every subset $T \subseteq S$ of size at least $(1-\epsilon)|S|$ , we have

$\big\|\frac{1}{|T|} \sum_{x \in T} (x-\mu)\big\| \leq \sigma.$

In other words, a set is resilient if every large set (of at least a $(1-\epsilon)$ -fraction of the elements) has mean close to $\mu$ . This means that if any $\epsilon$ -fraction of elements is deleted the empirical mean of the remaining points will still have small distance to $\mu$ .

I claimed earlier that robustness to deletions implies robustness to additions. This is formalized in the following proposition:

Proposition (Resilience $\implies$ Robustness). If $S$ is $(\sigma,\epsilon)$ -resilient, then there is an (exponential-time) algorithm for outputting a $\hat{\mu}$ with $\|\hat{\mu} - \mu\| \leq 2\sigma$ , even if Alice is allowed to add $\epsilon|S|$ arbitrary points.

Proof. The proof is a simple pigeonhole argument. Suppose that $\tilde{S}$ is the set of points that Bob observes, and that $S \subseteq \tilde{S}$ is the set of clean points, which is $(\sigma, \epsilon)$ -resilient by assumption.

Now let $S'$ be any $(\sigma, \epsilon)$ -resilient subset of $\tilde{S}$ of size $\frac{1}{1+\epsilon}|\tilde{S}|$ . (Such a set exists since $S$ is one such set.) We claim that the mean of any such $S'$ is within $2\sigma$ of the mean $\mu$ of $S$ .

Indeed, by pigeonhole we must have $|S \cap S'| \geq (1-\epsilon)|S|$ . In particular, taking $T = S \cap S'$ in the definition of resilience, we have

$\big\|\frac{1}{|S \cap S'|} \sum_{x \in S \cap S'} (x-\mu)\big\| \leq \sigma.$

In other words, the mean of $S \cap S'$ differs from the mean of $S$ by at most $\sigma$ . But since $S'$ is also resilient, the mean of $S \cap S'$ differs from the mean of $S'$ by at most $\sigma$ as well. Therefore, by the triangle inequality the means of $S'$ and $S$ are within $2\sigma$ , as claimed. $\blacksquare$

In summary, it suffices to find any large $(\sigma,\epsilon)$ -resilient set and output its mean. This procedure will be robust even to an addition of an $\epsilon$ -fraction of outliers.

Applications

Resilience gives us a way of showing that certain robust estimation problems are possible. For instance, suppose that we have data points $x_1, \ldots, x_n \sim p^*$ , where $p^*$ is a distribution with bounded covariance: $\Sigma \preceq \sigma_0^2 I$ , where $\Sigma$ is the covariance matrix of $p^*$ . Then one can show that as long as $n \gg d\log(d)$ , the points $x_i$ are $(\mathcal{O}(\sigma_0\sqrt{\epsilon}), \epsilon)$ -resilient in the $\ell_2$ -norm with high probability (this is because any set whose empirical covariance is bounded in spectral norm is resilient). In particular, it is possible to recover the mean to error $\mathcal{O}(\sigma_0\sqrt{\epsilon})$ in the presence of an $\epsilon$ -fraction of outliers. Note that for many values of $\epsilon$ this is substantially better than the naive bound that grows as $\epsilon \sqrt{d}$ instead of $\sqrt{\epsilon}$ .

More generally, if a distribution has bounded $k$ th moments, then samples from that distribution (for sufficiently large $n$ ) will be $(\mathcal{O}(\epsilon^{1-1/k}), \epsilon)$ -resilient, while samples from a sub-Gaussian distribution will be $(\mathcal{O}(\epsilon \sqrt{\log(1/\epsilon)}), \epsilon)$ -resilient.

Using other norms (such as the $\ell_1$ -norm) it is possible to get interesting results for problems with a more combinatorial flavor. In the paper, for instance, we show:

The $\ell_1$ -norm gives results for robust learning of discrete distributions.
A truncated $\ell_1$ -norm gives results for robustly learning stochastic block models. (Interestingly, it appears that the robust information-theoretic threshold probably matches the Kesten-Stigum threshold up to logarithmic factors).

The latter result on stochastic block models requires establishing the surprising fact that robust estimation is possible even with a majority of outliers. I will not go into detail here, but it is possible to show this using a modification of the pigeonhole argument above.

History and Discussion

The problem of outlier-robust learning is very classical, going back at least to Tukey (1970). However, our interest here is in the high-dimensional setting, which surprisingly does not seem to have had satisfactory answers until quite recently. I believe part of this may be due to some historical accident of definitions—in the statistics literature following Tukey, many researchers were interested in developing estimators with good breakdown points. The breakdown point is defined as the maximum fraction of outliers tolerated before the estimator becomes meaningless (for instance, the median has a breakdown point of 50%, while the mean has a breakdown point of 0% because a single outlier can change it arbitrarily). While many estimators have very bad breakdown points, Donoho (1982) and Donoho & Gasko (1992) developed an estimator that had a very good breakdown point of essentially 50% (even in high dimensions). However, the error in the estimator could be as large as $\epsilon \sqrt{d}$ in the presence of an $\epsilon$ -fraction of outliers. Another estimator with good robustness properties is the Tukey median (Tukey, 1975), but this is NP-hard to compute (Johnson & Preparata, 1978).

It is only very recently that (computationally-efficient) estimators with small error in high dimensions were developed. Concurrent papers by Lai, Rao, & Vempala (2016) and Diakonikolas, Kamath, Kane, Li, Moitra, & Stewart (2016) showed how to robustly estimate the mean of various distributions in the presence of outliers, with error depending at most logarithmically on the dimension (DKKLMS16 get error completely independent of the dimension). My own interest in this problem came from considering robustness of crowdsourced data collection when some fraction of the raters are dishonest (SVC, 2016). I worked on this problem with Greg and Moses and we later realized that our techniques were actually fairly general and could be used for robustly solving arbitrary convex minimization problems (CSV, 2017).

However, most of this recent work uses fairly sophisticated algorithms and in general I suspect it is not easy for outsiders to this area to understand all of the intuition behind what is going on. This is what motivated considering the information-theoretic question in the previous section, because I think that once we are okay ignoring computational efficiency the picture becomes much clearer.

Other Cool Stuff in the Paper

While I would be happy if the only thing you take away from this blog post is the proof that resilience implies robustness, if you are interested there is some other cool stuff in our paper. Specifically:

We obtain computationally efficient algorithms in certain settings (including $\ell_p$ -norms for $p \in [1,2]$ ).
We show that the idea of resilience is applicable beyond mean estimation (in particular, for low-rank recovery).
We show that for strongly convex norms, the properties of resilience and bounded covariance are closely linked.

To elaborate a bit more on the last point, it is not hard to show that any set whose empirical distribution has bounded covariance is also $(\sigma_0 \sqrt{\epsilon}, \epsilon)$ -resilient for all $\epsilon$ , where the value of $\sigma_0$ depends on the covariance bound. However, it turns out that there is a converse provided the norm is strongly convex—given a set that is resilient in a strongly convex norm, it is always possible to delete a small number of points such that the remaining points have bounded covariance. The strong convexity assumption is actually important and the proof is a nice application of minimax duality combined with Khintchine’s decoupling inequality.

Anyways, hopefully this provides some encouragement to read the full paper, and we would be very interested in any questions or feedback (feel free to leave them in the comments).

References

[CSV17] M. Charikar, J. Steinhardt, and G. Valiant, Learning from untrusted data, Symposium on Theory of Computing (STOC), 2017.
[DKKLMS16] I. Diakonikolas, G. Kamath, D. Kane, J. Li, A. Moitra, and A. Stewart. Robust estimators in high dimensions without the computational intractability. In Foundations of Computer Science (FOCS), 2016.
[D82] D. L. Donoho. Breakdown properties of multivariate location estimators. Ph.D. qualifying paper, 1982.
[DG92] D. L. Donoho and M. Gasko. Breakdown properties of location estimates based on halfspace depth and projected outlyingness. Annals of Statistics, 20(4):1803–1827, 1992.
[LRV16] K. A. Lai, A. B. Rao, and S. Vempala. Agnostic estimation of mean and covariance. In Foundations of Computer Science (FOCS), 2016.
[SCV18] J. Steinhardt, M. Charikar, and G. Valiant, Resilience: A criterion for learning in the presence of arbitrary outliers, Innovations in Theoretical Computer Science (ITCS), 2018.
[SVC16] J. Steinhardt, G. Valiant, and M. Charikar, Avoiding imposters and delinquents: Adversarial crowd-sourcing and peer prediction, Advances in Neural Information Processing Systems (NIPS), 2016.
[T60] J. W. Tukey. A survey of sampling from contaminated distributions. Contributions to probability and statistics, 2:448–485, 1960.
[T75] J. W. Tukey. Mathematics and picturing of data. In ICM, volume 6, pages 523–531, 1975.

	Roy Andrews on Prediction with a short m…
	Multiparty Computati… on A few lessons from the history…
	Arick Grootveld on Trace Reconstruction from Comp…
	DNF Minimization, Pa… on DNF Minimization, Part I
	Hao Sun on RANDOM & APPROX 2023
	Yet another backprop… on Backpropagation ≠ Chain R…
	Boaz Barak on Backpropagation ≠ Chain R…
	Lunjia Hu on Backpropagation ≠ Chain R…
	Boaz Barak on Backpropagation ≠ Chain R…
	kodlu on Average-Case Fine-Grained Hard…

Theory Dish

Stanford's CS Theory Research Blog

When is a dataset robust to outliers?