Does Pigeonhole Degrade Gracefully?

Posted on May 30, 2017 by somussmann in Uncategorized // 11 Comments

By Steve Mussmann and Jacob Steinhardt.

The celebrated pigeonhole principle says that if we have disjoint sets $S_1, S_2, \dots, S_m \subseteq [n]$ each of size $s$ , then $m$ (the number of sets) is at most $\lfloor \frac{n}{s} \rfloor$ .

Suppose that instead of being disjoint, we require that every pair of sets has intersection at most $1$ . How much can this change the maximum number of sets $m$ ? Is it still roughly $\frac{n}{s}$ , or could the number be much larger? Interestingly, it turns out that there is a sharp phase transition at $s = \sqrt{n}$ ; this is actually the key fact driving the second author’s recent paper on planted clique lower bounds (see the end of this post for some more details on this). These set systems are known as “combinatorial designs” and are heavily studied.

The phase transition is as follows: if $s \geq \sqrt{2n}$ , then $m = \Theta(n/s)$ (as is the case with disjoint sets), while if $s \leq \sqrt{n/2}$ then $m = \Theta(n^2/s^2)$ . Remarkably, even though there can be at most $\sqrt{2n}$ sets of size $\sqrt{2n}$ , it is possible to have as many as $n$ sets of size $\sqrt{n}$ (if $n=p^2$ for prime $p$ ).

To be a bit more formal, let $f(n,s)$ be the maximum number of subsets of $[n]$ of size $s$ with pairwise intersection sizes at most $1$ (i.e., $|S_i| = s$ and $|S_i \cap S_j| \leq 1$ for $i \neq j$ ). Assume that $s \geq 2$ and $n \geq 16$ . Then,

$\frac{n^2}{4 s^2} \leq f(n,s) \leq \frac{2 n^2}{s^2}$ for $s \leq \sqrt{n/2}$ , and

$\lfloor \frac{n}{s} \rfloor \leq f(n,s) \leq \frac{2n}{s}$ for $s \geq \sqrt{2n}$ .

Below we will show how to give upper and lower bounds on $f(n,s)$ in the case where $s \leq \sqrt{n/2}$ . (The case $s \geq \sqrt{2n}$ is a relatively straightforward application of inclusion-exclusion together with pigeonhole.)

Upper bound

Suppose we have sets $S_1, ..., S_m \subseteq [n]$ . For each $j \in [n]$ , we will define $T_j$ to be all the sets $S_i$ that contain $j$ . Namely, $T_j = \{S_i \mid S_i \ni j\}$ . Note that all sets in $T_j$ have a pairwise intersection at the element $j$ and thus must be non-intersecting in $[n] \backslash \{j\}$ . Thus, by pigeonhole, $|T_j| \leq \frac{n-1}{s-1}$ . Further, note that each $S_i$ occurs in $|S_i|$ distinct sets $T_j$ . Putting these together, we thus have

$|S_i| = s, \quad |T_j| \leq \frac{n-1}{s-1}, \quad \text{and} \quad \sum_{i=1}^m |S_i| = \sum_{j=1}^n |T_j|$ .

This yields $m \leq \frac{n (n-1)}{s (s-1)} \leq \frac{2n^2}{ s^2}$ , and hence $f(n,s) \leq \frac{2n^2}{s^2}$ .

Lower Bound

design2

Intuitively, organize $[n]$ into a rectangle with $s$ rows and $t = \lfloor \frac{n}{s} \rfloor$ columns (put any remaining elements off to the side). See figure above. Note that all columns of the rectangles are sets of size $s$ without any intersections and constitute $t$ sets. Further, we can add the sets composed of the diagonals and anti-diagonals (wrapping around on the left and right sides of the rectangle) which are sets of size $s$ without any pairwise intersections of size more than one. This will get us to $3t$ sets. We could attempt to create sets from more “generalized diagonals” where we start at the top of the rectangle and go horizontally by $2$ columns for every row that we go down (and wrap around on the left and right sides). However, if $t$ is divisible by $2$ , these “generalized diagonals” would intersect some columns in two places instead of one.

However, if $t$ is prime, then all such “generalized diagonals” will intersect each other in at most one point. More precisely, for every $a \in \{0,...,t-1\}, b \in \{0,...,t-1\}$ , we can create a set by taking the elements of the rectangle $(i, a+bi (\text{mod } t))$ for every row $i$ . There will be $t^2$ such sets that will all have size $s$ and will have pairwise intersections of size at most one since linear functions have at most one intersection over finite fields.

By Bertrand’s Postulate and since $n \geq 16$ , there is some prime number $t$ such that $\lceil \frac{n}{2s} \rceil \leq t \leq \lfloor \frac{n}{s} \rfloor$ . We can use this as number of columns in a rectangle. Note that we need $s \leq t$ which is satisfied since $s \leq \sqrt{n/2}$ . The number of “generalized diagonals” is $t^2$ as mentioned above. Thus, we have explicitly constructed $m$ sets of size $s$ with pairwise intersections at most one where

$m = t^2 \geq (\lceil \frac{n}{2s} \rceil)^2 \geq \frac{n^2}{4s^2},$

and hence $f(n,s) \geq \frac{n^2}{4s^2}$ as claimed.

Generalization to Larger Intersections

In general, what if we allow the sets to have intersections of size $d$ ? The above sections study the case where $d = 1$ , but the basic arguments generalize to larger values of $d$ as well. In particular, we have

$(\frac{n}{2s})^{d+1} \leq f(n,s,d) \leq \frac{\binom{n}{d+1}}{\binom{s}{d+1}}$ for $s \leq \sqrt{n/2}$ ,

$\lfloor \frac{n}{s} \rfloor \leq f(n,s,d) \leq \frac{2n}{s}$ for $s \geq \sqrt{2dn}$

where $f(n,s,d)$ is the maximum number of possible subsets of $[n]$ of size $s$ with pairwise intersections at most $d$ . Note that this leaves open the question of what happens for $\sqrt{n} \ll s \ll \sqrt{dn}$ ; in fact, there appears to be a phase transition around $\sqrt{dn}$ , but that is beyond the scope of this post.

Relationship to Planted Clique

The fact that there can be a large number of sets with small intersection when $s \leq \sqrt{n/2}$ is actually a large part of the motivation behind this recent paper on a $\sqrt{n}$ lower bound for the semi-random planted clique problem. To quote from the abstract, in the semi-random model, first “a set $S$ of vertices is chosen and all edges in $S$ are included; then, edges between $S$ and the rest of the graph are included with probability $\frac{1}{2}$ , while edges not touching $S$ are allowed to vary arbitrarily.”

Clearly, if the clique has size $s$ then it is possible to create $\frac{n}{s}$ identical cliques (since we can choose the edges between the remaining $n-s$ vertices arbitrarily). To handle this ambiguity, we also specify one vertex $v$ which is guaranteed to be in the planted clique $S$ . If $s \gg \sqrt{n\log(n)}$ then this is enough to uniquely identify the planted clique with high probability.

However, if $s \ll \sqrt{n}$ , then it is possible to create $\frac{2n}{s}$ cliques such that every vertex lies in exactly two cliques, and all cliques intersect in at most $1$ element. Moreover, if we choose the cliques appropriately then it is information-theoretically difficult to tell which of the $\frac{2n}{s}$ cliques was the original planted clique $S$ . As a result, no matter which vertex $v$ is specified, we cannot recover $S$ with probability greater than $\frac{1}{2}$ (since we can’t distinguish $S$ from the other clique that $v$ belongs to).

Here, having an intersection size of $1$ (rather than a much larger intersection) is key, because this makes it possible to “steal” from the random edges between $S$ and $[n] \backslash S$ to create additional cliques that intersect with $S$ . If the intersection size was instead $d \gg \log(n)$ , then with high probability there would be no subset of $d$ vertices of $S$ that had a common neighbor outside of $S$ , and so there would be no way to create a clique $S'$ that intersected $S$ in $d$ elements.

11 Comments on Does Pigeonhole Degrade Gracefully?

Ante // May 30, 2017 at 8:18 am // Reply

That’s pretty cool! I wonder if this can be used to explain some statistical mechanics effects or condensed matter effects, since there’s an approach to stat mech using combinatorics.

LikeLike
salilvadhan // May 30, 2017 at 4:26 pm // Reply

Such set families have been studied in the literature on packings, designs, and cover-free families, eg Erdos-Frankl-Furedi 85 and Rodl 85. See Lemma 13 and Prop 14 in http://people.seas.harvard.edu/~salil/research/trev-abs.html for pointers.

LikeLike
- salilvadhan // May 31, 2017 at 2:28 pm // Reply
  
  Sorry, I see now that you are aware of the literature on designs. I didn’t have a chance to think about the parameter range that you’re interested in, but explicit constructions for some parameter ranges are discussed in Nisan-Wigderson and in Buhrman-Miltersen-Radhakrishnan-Ta-Shma and perhaps other places in the TCS literature.
  
  LikeLike
  - salilvadhan // May 31, 2017 at 2:39 pm //
    
    .. last reference should be Buhrman-Miltersen-Radhakrishnan-Venkatesh
    
    LikeLike
  - jsteinhardt // June 6, 2017 at 10:28 am //
    
    Thanks! If I understand the Nisan-Widgerson construction, I believe that in the language of this post it says that there are subsets of {1, …., n} such that:
    
    (1) Each set has size s = sqrt(n)
    (2) Every pair of sets has intersection at most d
    (3) The number of sets is 2^d
    
    This shows that if we allow d to grow at a rate faster than log(n), the number of sets actually grows superpolynomially. Actually I believe that this is true even if d = omega(1), based on the “Generalization to Larger Intersections” given in the post.
    
    LikeLike
Manas Joglekar // May 30, 2017 at 10:24 pm // Reply

Great post! There are a couple of points that confused me (which are clarified on reading the paper):
1) The edges in ~S are chosen after the random edges between S and ~S.
2) v is sampled randomly from S, after the graph is fixed. (If it is chosen along with S before fixing the other edges, it’s easy to make multiple cliques that intersect only on v).

LikeLike
- jsteinhardt // June 6, 2017 at 10:28 am // Reply
  
  Yes, thanks! Both of your clarifications are correct.
  
  LikeLike
Boaz Barak // May 31, 2017 at 12:26 am // Reply

Nice post!!
More generally, if you choose random sets of size $s$ out of a universe of size $n$ then the expected size of the intersection is $s^2/n$ .
Once you allow the intersection size to be about that size, then you can have combinatorial designs with exponentially more sets than otherwise.
This is also the key insight behind the Nisam-Wigserson pseudorandom generator (where one sets the intersection size to be roughly logarithmic in the size of circuit one is trying to fool).

LikeLike
- jsteinhardt // June 6, 2017 at 10:33 am // Reply
  
  Yes, the value for random sets seems to be the key threshold; however from this it wasn’t obvious to me how to get the right order of growth in the number of sets once we pass that threshold. For instance it is natural to hope that we could just argue that if we pick a large number of sets randomly, with non-zero probability they will all have sufficiently small intersection. But even if we use something like LLL, we only get n^2/s^3 (see here: https://jsteinhardt.wordpress.com/2017/03/17/sets-with-small-intersection/), whereas the answer should be n^2/s^2.
  
  LikeLike
Jacob Fox // June 14, 2017 at 11:02 pm // Reply

Great post!

I thought I would point out that the problem discussed in this blog post is closely related to the classical Zarankiewicz problem, whose study goes back more than 60 years:

https://en.wikipedia.org/wiki/Zarankiewicz_problem

In the Zarankiewicz problem, you have an m x n matrix, and you want to know how many one entries can you have without having an s x t all ones submatrix. You can think of your problem as the rows as being the characteristic vectors of the sets, and you want to avoid having a 2 by (d+1) all ones submatrix. You have each set having size s, and the number of columns being n, and ask for how many rows can you have. For the Zarankiewicz problem, you fix the number of rows and you essentially ask how many ones you can have on average in each row. The classical upper bound proof of Kovari-Sos-Turan uses Jensen’s inequality, with the worse case being when all rows have the same number of ones, which is the case of the problem in the blog.

LikeLike
- jsteinhardt // October 13, 2017 at 11:51 am // Reply
  
  That’s a cool connection! So if I understand correctly, in the language of this post, and letting z denote the Zarankiewicz function, we have
  
  s*f(n,s) <= z(n, f(n,s), d+1, 2) < sqrt(n*d)*(f(n,s)-1) + n
  
  Assuming that sqrt(n*d)/s 1, we can get a better bound by swapping the order of the arguments:
  
  s*f(n,s) <= z(f(n,s), n, 2, d+1) d (which is true unless the problem is trivial) we can we-arrange to say that f(n,s) <= O(n/s)^(d+1), which matches our bound.
  
  LikeLike

	DNF Minimization, Pa… on DNF Minimization, Part I
	Hao Sun on RANDOM & APPROX 2023
	Yet another backprop… on Backpropagation ≠ Chain R…
	Boaz Barak on Backpropagation ≠ Chain R…
	Lunjia Hu on Backpropagation ≠ Chain R…
	Boaz Barak on Backpropagation ≠ Chain R…
	kodlu on Average-Case Fine-Grained Hard…
	Average-Case Fine-Gr… on Average-Case Fine-Grained Hard…
	Average-Case Fine-Gr… on Average-Case Fine-Grained Hard…
	Average-Case Fine-Gr… on Average-Case Fine-Grained Hard…

Theory Dish

Stanford's CS Theory Research Blog