I was excited to see “Approximating Edit Distance Within Constant Factor in Truly Sub-Quadratic Time” by Chakraborty, Das, Goldenberg, Koucky, and Saks in this year’s list of FOCS accepted papers. As you can guess from the title, their main result is a constant-factor approximation for edit distance in truly subquadratic time (i.e. $O(n^{2-\delta})$, for some constant $\delta$). I spent quite a bit of time thinking about this problem in the past couple of years, and wiser people than me have spent much more time for a couple of decades. Congratulations!!

Given strings $A$ and $B$ of $n$ characters each, the textbook dynamic programming algorithm finds their edit distance in time $O(n^{2})$ (if you haven’t seen this in your undergrad algorithms class, consider asking your university for a refund on your tuition). Recent complexity breakthroughs [1][2] show that under plausible assumptions like SETH, quadratic time is almost optimal for exact algorithms. This is too bad, because we like to compute edit distance between very long strings, like entire genomes of two organisms (see also this article in Quanta). There is a sequence of near-linear time algorithms with improved approximation factors [1][2][3][4], but until now the state of the art was polylogarithmic; actually for near-linear time, this is still the state of the art:

Open question 1: Is there a constant-factor approximation to edit distance that runs in near-linear time?

Update [5/2020]: Yay! A new near-linear time algorithm by Andoni and Nosatzki!

## Algorithm sketch

Here is a sketch of an algorithm. It is somewhat different from the algorithm in the paper because I wrote this post before finding the full paper online.

### Step 0: window-compatible matchings

We partition each string into $t$ windows, or consecutive substrings of length $d:=n/t$ each. We then restrict our attention to window-compatible matchings: that is, instead of looking for the globally optimum way to transform $A$ to $B$, we look for a partial matching between the $A$– and $B$-windows, and transform each $A$-window to its matching $B$-windows (unmatched $A$-windows are deleted, and unmatched $B$-windows are inserted). It turns out that restricting to window-compatible matchings is almost without loss of generality.

In order to find the optimum window-compatible matching, we can find the distances between every pair of windows, and then use a (weighted) dynamic program of size $t\times t\ll n^{2}$. The reason I call it “Step 0” is because so far we made zero progress on running time: we still have to compute the edit distance between $t^{2}$ pairs, and each computation takes time $d^{2}$, so $t^{2}d^{2}=n^{2}$ time in total.

Approximating all the pairwise distances reduces to the following problem: given threshold $\Delta$, compute the bipartite graph $G_{\Delta}$ over the windows, where two windows $A_{i}$ and $B_{j}$ share an edge if $ED (A_i, B_j) \leq \Delta$. In fact it suffices to compute an approximate $\widetilde{G}_{\Delta}$, where $A_{i}$ and $B_{j}$ may share an edge even if their edit distance is a little more than $\Delta$.

New Goal: Compute $\widetilde{G}_{\Delta}$ faster than naively computing all $t^{2}$ pairwise edit distances.

### Step 1: sparsifying $G_{\Delta}$

While there are many edges in $G_{\Delta}\setminus\widetilde{G}_{\Delta}$, say average degree $\geq t^{3/4}$: Draw a random edge $(A_{1},B_{1})$, and let $B_{j}, A_i$ be two other neighbors of $A_{1},B_{1}$, respectively. Applying the triangle inequality (twice), we have that $ED(A_{i},B_{j})\leq3\Delta$, so we can immediately add $(A_{i},B_{j})$ to $\widetilde{G}_{\Delta}$. In expectation, $A_{1},B_{1}$ have $\geq t^{3/4}$ neighbors each, so we discovered a total of $\geq t^{6/4}$ pairs; of which we expect that roughly $\geq t^{5/4}$ correspond to new edges in $\widetilde{G}_{\Delta}$. Repeat at most $t^{3/4}$ times until we discovered almost all the edges in $G_{\Delta}$. Notice that each iteration took us $td^{2}$ time (computing all the edit distances from $A_{1}$ and $B_{1}$); hence in total only $t^{7/4}d^{2}\ll t^{2}d^{2}=n^{2}$. Thus we reduced to the sparse case in truly subquadratic time.

The algorithm up to this point is actually due to a recent paper by Boroujeni et al; for the case when $G_{\Delta}$ is relatively sparse, they use Grover Search to discover all the remaining edges in quantum subquadratic time. It remains to see how to do it classically…

### Step 2: when $G_{\Delta}$ is relatively sparse

The main observation we need for this part is that if windows $A_{i}$ and $A_{j}$ are close, then in an optimum window-compatible matching they are probably not matched to $B$-windows that are very far apart. And in the rare event that they are matched to far-apart $B$-windows, the cost of inserting so many characters between $A_{i}$ and $A_{j}$ outweighs the cost of completely replacing $A_{j}$ if we had to. So once we have a candidate list of $B$-windows that $A_{i}$ might match to, it is safe to only search for good matches for $A_{j}$ around each of those candidates. But when the graph $G_{\Delta}$ is sparse, we have such a short list: the neighbors of $A_{i}$!

We have to be a bit careful: for example, it is possible that $A_{i}$ is not matched to any of its neighbors in $G_{\Delta}$. But if we sample enough $A_{i}$‘s from some interval around $A_{j}$, then either (i) at least one of them is matched to a neighbor in $G_{\Delta}$; or (ii) $G_{\Delta}$ doesn’t contribute much to reducing the edit distance for this interval, so it’s OK to miss some of those edges.

## Improving the approximation factor

On the back of my virtual envelope, I think the above ideas give a $(3+\varepsilon)$-approximation. But as far as genomes go, up to a $(3+\varepsilon)$-approximation, you’re as closely related to your dog as to a banana. So it would be great to improve the approximation factor:

Open question 2: Is there a $(1+\varepsilon)$-approximation algorithm for edit distance in truly subquadratic (or near-linear) time?

Note that only the sparsification step loses more than $(1+\varepsilon)$ in approximation. Also, none of the existing fine-grained hardness results rule out an $(1+\varepsilon)$-approximation, even in linear time!

#### 4 Comments on Approximating Edit Distance

1. A related open problem (that is arguably more relevant to practice than computing the edit distance, see, e.g., [1]) is the approximate nearest neighbor search for the edit distance. The only known way to do it (in the worst case, there are some decent heuristics) is to embed the edit distance into l_1 [2] and use locality-sensitive hashing. This gives approximation 2^{sqrt{log d log log d}}, where d is the length of strings. Can one improve it to poly(log d)? To O(1)?

Liked by 1 person

Thanks Ilya! Embarrassingly, the state-of-the-art conditional hardness result for approximate nearest neighbor with edit distance metric also goes through $l_1$ (and thus only gives hardness for a tiny $(1+\varepsilon)$-factor.

Liked by 1 person

• Indeed, the edit distance contains a copy of l_1, but we currently have *no* evidence that ANN for edit distance is harder than for l_1. As our framework [1] suggests, the heart of the matter is to understand how well shortest-path metrics of expander graphs embed into the edit distance.

More precisely, it would be great to show that the cutting modulus (in the sense of a short document [2]) of {0, 1}^d equipped with the edit distance is poly(log d). I think it should be true.

Like

3. Mohammad Hajiaghayi // July 20, 2018 at 12:56 pm // Reply

Wow! Thanks Aviad for the the nice blog post on this. In hindsight it looks our SODA’18 paper: was so close — missing only that last step.

Liked by 1 person