Suppose that $s$ is an unknown binary string of length $n$. We are asked to recover $s$ from its traces, and each trace $\tilde s$ is a random subsequence obtained by deleting the bits of $s$ independently with probability $1/2$.

More formally, let $\mathcal{D}_s$ denote the distribution of traces obtained from string $s$. For example, when $s = 110$, $\mathcal{D}_s$ assigns a probability mass of $1/8$ to each element in the multiset

$\{\text{empty}, 1, 1, 0, 11, 10, 10, 110\}.$

We then ask: what is the smallest number $M(n)$ such that we can recover any $s \in \{0, 1\}^n$ (say, with probability $\ge 0.99$) given $M(n)$ independent samples from $\mathcal{D}_s$?

This trace reconstruction problem was first formulated by Batu-Kannan-Khanna-McGregor in 2004, and their central motivation is from the multiple sequence alignment problem in computational biology. Trace reconstruction is also a fundamental problem related to the deletion channel in communication theory: We view the hidden string as the transmitted message, and each trace as a received message that went through a deletion channel, which drops each bit with probability $1/2$. Then the sample complexity $M(n)$ tells us the number of independent copies that need to be sent for the receiver to determine the original message.

Despite being a natural problem, trace reconstruction is still far from being well-understood, even from the information-theoretic perspective (i.e., without considering the computational complexity). In 2017, De-O’Donnell-Servedio and Nazarov-Peres independently proved $M(n) \le \exp(O(n^{1/3}))$. A very recent breakthrough due to Chase further improved this bound to $\exp(\tilde O(n^{1/5}))$, which is still super-polynomial. On the other hand, the best known sample complexity lower bound is merely $\tilde\Omega(n^{3/2})$, proved by Chase in another recent work.

In this blog post, I will explain why this problem is much more non-trivial than it might appear at first glance. I will also give an overview on the work of [DOS17, NP17], which, interestingly, reduces this seemingly combinatorial problem to complex analysis.

Observation: reconstruction $\approx$ distinguishing $\approx$ TV-distance. Let us start with a natural first attempt at the problem. Define $\delta_n$ as the minimum statistical distance between the trace distribution of two different length-$n$ strings:

$\delta_n = \min_{x \ne y \in \{0, 1\}^n}d_{\textrm{TV}}(\mathcal{D}_x, \mathcal{D}_y).$

It is not hard to show that $1/\delta_n$ bounds $M(n)$ on both sides, up to a polynomial factor:

$1/\delta_n \lesssim M(n) \lesssim n/\delta_n^2.$

The lower bound holds because any trace reconstruction algorithm must be able to distinguish $s = x$ from $s = y$ for every pair of different strings $(x, y)$, and this requires $\Omega(1 / \delta_n)$ samples for the minimizer $(x, y)$ in the definition of $\delta_n$. For the upper bound, we note that every pair $(x, y)$ can be distinguished with an $o(2^{-n})$ error probability using $O(n/\delta_n^2)$ samples. We say that string $x$ “beats” $y$, if the distinguisher for $(x, y)$ decides that “$s = x$“. By a union bound, the correct answer $s$ “beats” every other string with high probability. We can then obtain a reconstruction algorithm by running the distinguisher for every string pair, and outputting the unique string that “beats” the other $2^n-1$ strings.

Thus, to determine whether the sample complexity $M(n)$ is polynomial in $n$, it suffices to determine whether $\delta_n$ scales as $1/\mathrm{poly}(n)$ or is much smaller. Unfortunately, it turns out to be highly non-trivial to bound $\delta_n$, and even bounding $d_{\textrm{TV}}(\mathcal{D}_x, \mathcal{D}_y)$ for “simple” $x$ and $y$ can be hard. Imagine that we try to reason about the distribution $\mathcal{D}_x$: the probability mass that $\mathcal{D}_x$ assigns to string $\tilde s$ is proportional to the number of times that $\tilde s$ appears as a subsequence in $x$, but this count is already hard to express or control, unless $x$ has a very simple pattern. This is roughly where this natural attempt gets stuck.

Distinguishing using estimators. Now we turn to a different approach that underlies the recent breakthrough on trace reconstruction algorithms. As discussed earlier, we can focus on the problem of distinguishing the case $s = x$ from $s = y$ for fixed strings $x \ne y \in \{0, 1\}^n$. Let $f: \{0, 1\}^{\le n} \to \mathbb{R}$ be a function defined over all possible traces from a length-$n$ string. We consider the following algorithm for distinguishing $\mathcal{D}_x$ from $\mathcal{D}_y$:

Step 1. Given traces $\tilde s_1, \tilde s_2, \ldots$, compute the average of $f(\tilde s_1), f(\tilde s_2), \ldots$
Step 2. Output $x$ if the average is closer to $\mathbb{E}_{\tilde x \sim \mathcal{D}_x}[f(\tilde x)]$ than to $\mathbb{E}_{\tilde y \sim \mathcal{D}_y}[f(\tilde y)]$, and output $y$ otherwise.

For the above to succeed, $f$ needs to satisfy the following two conditions:

(Separation) $\mathcal{D}_x$ and $\mathcal{D}_y$ are well-separated under $f$, namely $|\mathbb{E}[f(\tilde x)] - \mathbb{E}[f(\tilde y)]| \ge \epsilon$ for some $\epsilon > 0$.
(Boundedness) Expectation of $f$ can be efficiently estimated. This can be guaranteed if for some $B$, $|f(\tilde s)| \le B$ holds for every $\tilde s \in \{0, 1\}^{\le n}$.

Assuming the above, a standard concentration argument shows that we can distinguish $x$ and $y$ using $O((B/\epsilon)^2)$ samples, and thus $M(n) \lesssim n \cdot (B/\epsilon)^2$.

In principle, a near-optimal choice of $f$ would be setting $f(\tilde s)$ to be the indicator of $\mathcal{D}_x(\tilde s) \ge \mathcal{D}_y(\tilde s)$, which gives boundedness $B = 1$ and separation $\epsilon = d_{\textrm{TV}}(\mathcal{D}_x, \mathcal{D}_y) \ge \delta_n$. However, as argued above, this optimal choice of $f$ can still be hard to analyze. Instead, we focus on choosing a simpler function $f$, which potentially gives a suboptimal $B/\epsilon$ but admits simple analyses.

Sketch of the $\exp(O(n^{1/3}))$ Upper Bound. The main results of [DOS17] and [NP17] follow from choosing $f$ to be a linear function. Given a trace $\tilde s = \tilde s_0 \tilde s_1 \tilde s_2 \cdots$, we consider the following polynomial with coefficients being the bits of $\tilde s$:

$\tilde S(z) = \sum_{k}\tilde s_k z^k = \tilde s_0 + \tilde s_1 z + \tilde s_2 z^2 + \cdots.$

For some number $z$ to be determined later, we consider the estimator $f(\tilde s) = \tilde S(z) = \tilde s_0 + \tilde s_1 z + \tilde s_2 z^2 + \cdots$, which is indeed a linear function in the trace $\tilde s$.

At first glance, it might be unclear why we choose the coefficients to be powers of $z$. The following fact justifies this choice by showing that polynomials interact with the deletion channel very nicely: The expectation of $\tilde S(z)$ is exactly the evaluation of the polynomial $S(\cdot)$ with coefficients $s_0, s_1, \ldots, s_{n-1}$, but at a slightly different point.

Fact: For any string $s \in \{0, 1\}^n$,

$\mathbb{E}_{\tilde s \sim \mathcal{D}_s}\left[\tilde S(z)\right] = \frac{1}{2}S\left(\frac{z+1}{2}\right).$

Equivalently, we have $\mathbb{E}[\tilde S(2z-1)] = \frac{1}{2}S(z)$, which allows us to rephrase our requirements on the choice of $z$ as follows:

(Separation) $|X(z) - Y(z)| \ge \epsilon$ for some $\epsilon$ that is not too small.
(Boundedness) $|\tilde S(2z - 1)| \le B$ for all possible trace $\tilde s$, for some $B$ that is not too large.

After some thought, it is beneficial to choose $z$ such that both $|z|$ and $|2z - 1|$ are close to $1$, since this ensures that different bits in the string are assigned weights of similar magnitudes in the polynomials above. These two conditions hold if and only if $z \approx 1$.

The crucial idea in both papers [DOS17, NP17] is to consider the polynomials in the complex plane $\mathbb{C}$ instead of on the real line. Fortunately, all the previous discussion still holds for complex numbers, with $|\cdot|$ interpreted as modulus instead of absolute value. In [NP17], the authors chose $z$ from a small arc of the unit circle:

$A_L = \{e^{i\theta}: \theta\in[-\pi/L, \pi/L]\}$.

For every $z \in A_L$, it is easy to upper bound $|\tilde S(2z - 1)|$: it follows from simple calculus that $|2z - 1| \le 1 + O(L^{-2})$, and thus for every $\tilde s$:

$\left|\tilde S(2z-1)\right| \le \sum_{k=0}^{n-1}\left|(2z-1)^k\right| \le n \cdot \left[1 + O(L^{-2})\right]^n = e^{O(n/L^2)}$.

To lower bound $X(z) - Y(z)$, note that since both $x$ and $y$ are binary, all the coefficients of $X-Y$ are in $\{-1, 0, 1\}$. Such polynomials are known as Littlewood polynomials. The separation condition is then reduced to the following claim in complex analysis:

Littlewood polynomials cannot to be too “flat” over a short arc around 1.

This is indeed the case:

Lemma 1. ([Borwein and Erdélyi, 1997]) For every nonzero Littlewood polynomial $p$,

$\max_{z \in A_L}|p(z)| \ge e^{-O(L)}$.

By Lemma 1, there exists $z \in A_L$ such that the separation condition holds for $\epsilon = e^{-O(L)}$. Recall that the boundedness holds for $B = e^{O(n/L^2)}$. This shows that the sample complexity is upper bounded by $(B/\epsilon)^2 = \exp(O(n/L^2 + L))$, which is minimized at $L = n^{1/3}$. This proves the upper bound $M(n) = \exp(O(n^{1/3}))$.

Proof of a weaker lemma. While the original proof of Lemma 1 is a bit technical, [NP17] presented a beautiful and much simpler proof of the following weaker result, in which the $e^{-O(L)}$ lower bound is replaced by $n^{-O(L)}$, where $n$ is the degree of the Littlewood polynomial.

Lemma 2. ([Lemma 3.1, NP17]) For every nonzero Littlewood polynomial $p$ of degree $< n$,

$\max_{z \in A_L}|p(z)| \ge n^{-O(L)}$.

Proof. Without loss of generality, we assume that the constant term of $p(z)$ is $1$. Suppose otherwise, that the lowest order term of $p(z)$ is $z^m$ for some $m \ge 1$. We may consider the polynomial $p(z) / z^m$ instead.

Let $M = \max_{z \in A_L}|p(z)|$ be the maximum modulus of $p$ over arc $A_L$, and $\omega = e^{2\pi i/L}$ be an $L$-th root of unity. Consider the following polynomial:

$q(z) = \prod_{j=0}^{L-1}p(\omega^j z) = p(z) \cdot p(\omega z) \cdot \cdots \cdot p(\omega^{L-1} z)$.

For every $z$ on the unit circle, at least one point $\omega^j z$ falls into the arc $A_L$, so that $|p(\omega^j z)| \le M$. (See figure below for a proof by picture.) The moduli of the remaining $L - 1$ factors are trivially bounded by $n$. Thus, $|q(z)| \le M\cdot n^{L-1}$.

On the other hand, we note $q(0) = [p(0)]^L = 1$. The maximum modulus principle implies that for some $z$ on the unit circle, we have $|q(z)| \ge 1$. Therefore, we must have $M \cdot n^{L - 1} \ge 1$, which implies $M \ge n^{-(L - 1)} = n^{-O(L)}$ and completes the proof. $\square$

Beyond linear estimators? The work of [DOS17, NP17] not only proved the $\exp(O(n^{1/3}))$ upper bound, but also showed that this is the best sample complexity we can get from linear estimator $f$. The recent work of [Cha20] goes beyond these linear estimators by taking the higher moments of the trace into account. The idea turns out to be a natural one: consider the “$k$-grams” (i.e., length-$k$ substrings) of the string for some $k > 1$. In more detail, we fix some string $w \in \{0, 1\}^k$, and consider the binary polynomial with coefficients corresponding to the positions at which $w$ appears as a (contiguous) substring. The key observation is that there exists a choice of $w$ such that the resulting polynomial is sparse (in the sense that the degrees of the nonzero monomials are far away). The technical part of [Cha20] is then devoted to proving an analogue of Lemma 1 that is specialized to these “sparse” Littlewood polynomials.

Acknowledgments. I would like to thank Moses Charikar, Li-Yang Tan and Gregory Valiant for being on my quals committee and for helpful discussions about this problem.