Trace Reconstruction from Complex Analysis

Suppose that s is an unknown binary string of length n. We are asked to recover s from its traces, and each trace \tilde s is a random subsequence obtained by deleting the bits of s independently with probability 1/2.

More formally, let \mathcal{D}_s denote the distribution of traces obtained from string s. For example, when s = 110, \mathcal{D}_s assigns a probability mass of 1/8 to each element in the multiset

\{\text{empty}, 1, 1, 0, 11, 10, 10, 110\}.

We then ask: what is the smallest number M(n) such that we can recover any s \in \{0, 1\}^n (say, with probability \ge 0.99) given M(n) independent samples from \mathcal{D}_s?

This trace reconstruction problem was first formulated by Batu-Kannan-Khanna-McGregor in 2004, and their central motivation is from the multiple sequence alignment problem in computational biology. Trace reconstruction is also a fundamental problem related to the deletion channel in communication theory: We view the hidden string as the transmitted message, and each trace as a received message that went through a deletion channel, which drops each bit with probability 1/2. Then the sample complexity M(n) tells us the number of independent copies that need to be sent for the receiver to determine the original message.

Despite being a natural problem, trace reconstruction is still far from being well-understood, even from the information-theoretic perspective (i.e., without considering the computational complexity). In 2017, De-O’Donnell-Servedio and Nazarov-Peres independently proved M(n) \le \exp(O(n^{1/3})). A very recent breakthrough due to Chase further improved this bound to \exp(\tilde O(n^{1/5})), which is still super-polynomial. On the other hand, the best known sample complexity lower bound is merely \tilde\Omega(n^{3/2}), proved by Chase in another recent work.

In this blog post, I will explain why this problem is much more non-trivial than it might appear at first glance. I will also give an overview on the work of [DOS17, NP17], which, interestingly, reduces this seemingly combinatorial problem to complex analysis.

Observation: reconstruction \approx distinguishing \approx TV-distance. Let us start with a natural first attempt at the problem. Define \delta_n as the minimum statistical distance between the trace distribution of two different length-n strings:

\delta_n = \min_{x \ne y \in \{0, 1\}^n}d_{\textrm{TV}}(\mathcal{D}_x, \mathcal{D}_y).

It is not hard to show that 1/\delta_n bounds M(n) on both sides, up to a polynomial factor:

1/\delta_n \lesssim M(n) \lesssim n/\delta_n^2.

The lower bound holds because any trace reconstruction algorithm must be able to distinguish s = x from s = y for every pair of different strings (x, y), and this requires \Omega(1 / \delta_n) samples for the minimizer (x, y) in the definition of \delta_n. For the upper bound, we note that every pair (x, y) can be distinguished with an o(2^{-n}) error probability using O(n/\delta_n^2) samples. We say that string x “beats” y, if the distinguisher for (x, y) decides that “s = x“. By a union bound, the correct answer s “beats” every other string with high probability. We can then obtain a reconstruction algorithm by running the distinguisher for every string pair, and outputting the unique string that “beats” the other 2^n-1 strings.

Thus, to determine whether the sample complexity M(n) is polynomial in n, it suffices to determine whether \delta_n scales as 1/\mathrm{poly}(n) or is much smaller. Unfortunately, it turns out to be highly non-trivial to bound \delta_n, and even bounding d_{\textrm{TV}}(\mathcal{D}_x, \mathcal{D}_y) for “simple” x and y can be hard. Imagine that we try to reason about the distribution \mathcal{D}_x: the probability mass that \mathcal{D}_x assigns to string \tilde s is proportional to the number of times that \tilde s appears as a subsequence in x, but this count is already hard to express or control, unless x has a very simple pattern. This is roughly where this natural attempt gets stuck.

Distinguishing using estimators. Now we turn to a different approach that underlies the recent breakthrough on trace reconstruction algorithms. As discussed earlier, we can focus on the problem of distinguishing the case s = x from s = y for fixed strings x \ne y \in \{0, 1\}^n. Let f: \{0, 1\}^{\le n} \to \mathbb{R} be a function defined over all possible traces from a length-n string. We consider the following algorithm for distinguishing \mathcal{D}_x from \mathcal{D}_y:

Step 1. Given traces \tilde s_1, \tilde s_2, \ldots, compute the average of f(\tilde s_1), f(\tilde s_2), \ldots
Step 2. Output x if the average is closer to \mathbb{E}_{\tilde x \sim \mathcal{D}_x}[f(\tilde x)] than to \mathbb{E}_{\tilde y \sim \mathcal{D}_y}[f(\tilde y)], and output y otherwise.

For the above to succeed, f needs to satisfy the following two conditions:

(Separation) \mathcal{D}_x and \mathcal{D}_y are well-separated under f, namely |\mathbb{E}[f(\tilde x)] - \mathbb{E}[f(\tilde y)]| \ge \epsilon for some \epsilon > 0.
(Boundedness) Expectation of f can be efficiently estimated. This can be guaranteed if for some B, |f(\tilde s)| \le B holds for every \tilde s \in \{0, 1\}^{\le n}.

Assuming the above, a standard concentration argument shows that we can distinguish x and y using O((B/\epsilon)^2) samples, and thus M(n) \lesssim n \cdot (B/\epsilon)^2.

In principle, a near-optimal choice of f would be setting f(\tilde s) to be the indicator of \mathcal{D}_x(\tilde s) \ge \mathcal{D}_y(\tilde s), which gives boundedness B = 1 and separation \epsilon = d_{\textrm{TV}}(\mathcal{D}_x, \mathcal{D}_y) \ge \delta_n. However, as argued above, this optimal choice of f can still be hard to analyze. Instead, we focus on choosing a simpler function f, which potentially gives a suboptimal B/\epsilon but admits simple analyses.

Sketch of the \exp(O(n^{1/3})) Upper Bound. The main results of [DOS17] and [NP17] follow from choosing f to be a linear function. Given a trace \tilde s = \tilde s_0 \tilde s_1 \tilde s_2 \cdots, we consider the following polynomial with coefficients being the bits of \tilde s:

\tilde S(z) = \sum_{k}\tilde s_k z^k = \tilde s_0 + \tilde s_1 z + \tilde s_2 z^2 + \cdots.

For some number z to be determined later, we consider the estimator f(\tilde s) = \tilde S(z) = \tilde s_0 + \tilde s_1 z + \tilde s_2 z^2 + \cdots, which is indeed a linear function in the trace \tilde s.

At first glance, it might be unclear why we choose the coefficients to be powers of z. The following fact justifies this choice by showing that polynomials interact with the deletion channel very nicely: The expectation of \tilde S(z) is exactly the evaluation of the polynomial S(\cdot) with coefficients s_0, s_1, \ldots, s_{n-1}, but at a slightly different point.

Fact: For any string s \in \{0, 1\}^n,

\mathbb{E}_{\tilde s \sim \mathcal{D}_s}\left[\tilde S(z)\right] = \frac{1}{2}S\left(\frac{z+1}{2}\right).

Equivalently, we have \mathbb{E}[\tilde S(2z-1)] = \frac{1}{2}S(z), which allows us to rephrase our requirements on the choice of z as follows:

(Separation) |X(z) - Y(z)| \ge \epsilon for some \epsilon that is not too small.
(Boundedness) |\tilde S(2z - 1)| \le B for all possible trace \tilde s, for some B that is not too large.

After some thought, it is beneficial to choose z such that both |z| and |2z - 1| are close to 1, since this ensures that different bits in the string are assigned weights of similar magnitudes in the polynomials above. These two conditions hold if and only if z \approx 1.

The overall plan for distinguishing strings x and y.

The crucial idea in both papers [DOS17, NP17] is to consider the polynomials in the complex plane \mathbb{C} instead of on the real line. Fortunately, all the previous discussion still holds for complex numbers, with |\cdot| interpreted as modulus instead of absolute value. In [NP17], the authors chose z from a small arc of the unit circle:

A_L = \{e^{i\theta}: \theta\in[-\pi/L, \pi/L]\}.

Arc A_L marked by the red box and dotted lines.

For every z \in A_L, it is easy to upper bound |\tilde S(2z - 1)|: it follows from simple calculus that |2z - 1| \le 1 + O(L^{-2}), and thus for every \tilde s:

\left|\tilde S(2z-1)\right| \le \sum_{k=0}^{n-1}\left|(2z-1)^k\right| \le n \cdot \left[1 + O(L^{-2})\right]^n = e^{O(n/L^2)}.

To lower bound X(z) - Y(z), note that since both x and y are binary, all the coefficients of X-Y are in \{-1, 0, 1\}. Such polynomials are known as Littlewood polynomials. The separation condition is then reduced to the following claim in complex analysis:

Littlewood polynomials cannot to be too “flat” over a short arc around 1.

This is indeed the case:

Lemma 1. ([Borwein and Erdélyi, 1997]) For every nonzero Littlewood polynomial p,

\max_{z \in A_L}|p(z)| \ge e^{-O(L)}.

By Lemma 1, there exists z \in A_L such that the separation condition holds for \epsilon = e^{-O(L)}. Recall that the boundedness holds for B = e^{O(n/L^2)}. This shows that the sample complexity is upper bounded by (B/\epsilon)^2 = \exp(O(n/L^2 + L)), which is minimized at L = n^{1/3}. This proves the upper bound M(n) = \exp(O(n^{1/3})).

Proof of a weaker lemma. While the original proof of Lemma 1 is a bit technical, [NP17] presented a beautiful and much simpler proof of the following weaker result, in which the e^{-O(L)} lower bound is replaced by n^{-O(L)}, where n is the degree of the Littlewood polynomial.

Lemma 2. ([Lemma 3.1, NP17]) For every nonzero Littlewood polynomial p of degree < n,

\max_{z \in A_L}|p(z)| \ge n^{-O(L)}.

Proof. Without loss of generality, we assume that the constant term of p(z) is 1. Suppose otherwise, that the lowest order term of p(z) is z^m for some m \ge 1. We may consider the polynomial p(z) / z^m instead.

Let M = \max_{z \in A_L}|p(z)| be the maximum modulus of p over arc A_L, and \omega = e^{2\pi i/L} be an L-th root of unity. Consider the following polynomial:

q(z) = \prod_{j=0}^{L-1}p(\omega^j z) = p(z) \cdot p(\omega z) \cdot \cdots \cdot p(\omega^{L-1} z).

For every z on the unit circle, at least one point \omega^j z falls into the arc A_L, so that |p(\omega^j z)| \le M. (See figure below for a proof by picture.) The moduli of the remaining L - 1 factors are trivially bounded by n. Thus, |q(z)| \le M\cdot n^{L-1}.

On the other hand, we note q(0) = [p(0)]^L = 1. The maximum modulus principle implies that for some z on the unit circle, we have |q(z)| \ge 1. Therefore, we must have M \cdot n^{L - 1} \ge 1, which implies M \ge n^{-(L - 1)} = n^{-O(L)} and completes the proof. \square

Points involved in the definition of q(z) for L=4.
In this case, \omega^3 z is inside arc A_L, which is marked by the red lines.

Beyond linear estimators? The work of [DOS17, NP17] not only proved the \exp(O(n^{1/3})) upper bound, but also showed that this is the best sample complexity we can get from linear estimator f. The recent work of [Cha20] goes beyond these linear estimators by taking the higher moments of the trace into account. The idea turns out to be a natural one: consider the “k-grams” (i.e., length-k substrings) of the string for some k > 1. In more detail, we fix some string w \in \{0, 1\}^k, and consider the binary polynomial with coefficients corresponding to the positions at which w appears as a (contiguous) substring. The key observation is that there exists a choice of w such that the resulting polynomial is sparse (in the sense that the degrees of the nonzero monomials are far away). The technical part of [Cha20] is then devoted to proving an analogue of Lemma 1 that is specialized to these “sparse” Littlewood polynomials.

Acknowledgments. I would like to thank Moses Charikar, Li-Yang Tan and Gregory Valiant for being on my quals committee and for helpful discussions about this problem.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: