By Steve Mussmann and Jacob Steinhardt.
The celebrated pigeonhole principle says that if we have disjoint sets each of size , then (the number of sets) is at most .
Suppose that instead of being disjoint, we require that every pair of sets has intersection at most . How much can this change the maximum number of sets ? Is it still roughly , or could the number be much larger? Interestingly, it turns out that there is a sharp phase transition at ; this is actually the key fact driving the second author’s recent paper on planted clique lower bounds (see the end of this post for some more details on this). These set systems are known as “combinatorial designs” and are heavily studied.
The phase transition is as follows: if , then (as is the case with disjoint sets), while if then . Remarkably, even though there can be at most sets of size , it is possible to have as many as sets of size (if for prime ).
To be a bit more formal, let be the maximum number of subsets of of size with pairwise intersection sizes at most (i.e., and for ). Assume that and . Then,
for , and
Below we will show how to give upper and lower bounds on in the case where . (The case is a relatively straightforward application of inclusion-exclusion together with pigeonhole.)
Suppose we have sets . For each , we will define to be all the sets that contain . Namely, . Note that all sets in have a pairwise intersection at the element and thus must be non-intersecting in . Thus, by pigeonhole, . Further, note that each occurs in distinct sets . Putting these together, we thus have
This yields , and hence .
Intuitively, organize into a rectangle with rows and columns (put any remaining elements off to the side). See figure above. Note that all columns of the rectangles are sets of size without any intersections and constitute sets. Further, we can add the sets composed of the diagonals and anti-diagonals (wrapping around on the left and right sides of the rectangle) which are sets of size without any pairwise intersections of size more than one. This will get us to sets. We could attempt to create sets from more “generalized diagonals” where we start at the top of the rectangle and go horizontally by columns for every row that we go down (and wrap around on the left and right sides). However, if is divisible by , these “generalized diagonals” would intersect some columns in two places instead of one.
However, if is prime, then all such “generalized diagonals” will intersect each other in at most one point. More precisely, for every , we can create a set by taking the elements of the rectangle for every row . There will be such sets that will all have size and will have pairwise intersections of size at most one since linear functions have at most one intersection over finite fields.
By Bertrand’s Postulate and since , there is some prime number such that . We can use this as number of columns in a rectangle. Note that we need which is satisfied since . The number of “generalized diagonals” is as mentioned above. Thus, we have explicitly constructed sets of size with pairwise intersections at most one where
and hence as claimed.
Generalization to Larger Intersections
In general, what if we allow the sets to have intersections of size ? The above sections study the case where , but the basic arguments generalize to larger values of as well. In particular, we have
where is the maximum number of possible subsets of of size with pairwise intersections at most . Note that this leaves open the question of what happens for ; in fact, there appears to be a phase transition around , but that is beyond the scope of this post.
Relationship to Planted Clique
The fact that there can be a large number of sets with small intersection when is actually a large part of the motivation behind this recent paper on a lower bound for the semi-random planted clique problem. To quote from the abstract, in the semi-random model, first “a set of vertices is chosen and all edges in are included; then, edges between and the rest of the graph are included with probability , while edges not touching are allowed to vary arbitrarily.”
Clearly, if the clique has size then it is possible to create identical cliques (since we can choose the edges between the remaining vertices arbitrarily). To handle this ambiguity, we also specify one vertex which is guaranteed to be in the planted clique . If then this is enough to uniquely identify the planted clique with high probability.
However, if , then it is possible to create cliques such that every vertex lies in exactly two cliques, and all cliques intersect in at most element. Moreover, if we choose the cliques appropriately then it is information-theoretically difficult to tell which of the cliques was the original planted clique . As a result, no matter which vertex is specified, we cannot recover with probability greater than (since we can’t distinguish from the other clique that belongs to).
Here, having an intersection size of (rather than a much larger intersection) is key, because this makes it possible to “steal” from the random edges between and to create additional cliques that intersect with . If the intersection size was instead , then with high probability there would be no subset of vertices of that had a common neighbor outside of , and so there would be no way to create a clique that intersected in elements.