# Approximating Edit Distance

I was excited to see “Approximating Edit Distance Within Constant Factor in Truly Sub-Quadratic Time” by Chakraborty, Das, Goldenberg, Koucky, and Saks in this year’s list of FOCS accepted papers. As you can guess from the title, their main result is a constant-factor approximation for edit distance in truly subquadratic time (i.e. , for some constant ). I spent quite a bit of time thinking about this problem in the past couple of years, and wiser people than me have spent much more time for a couple of decades. Congratulations!!

Given strings and of characters each, the textbook dynamic programming algorithm finds their edit distance in time (if you haven’t seen this in your undergrad algorithms class, consider asking your university for a refund on your tuition). Recent complexity breakthroughs [1][2] show that under plausible assumptions like SETH, quadratic time is almost optimal for exact algorithms. This is too bad, because we like to compute edit distance between very long strings, like entire genomes of two organisms (see also this article in Quanta). There is a sequence of near-linear time algorithms with improved approximation factors [1][2][3][4], but until now the state of the art was polylogarithmic; ~~actually for near-linear time, this is still the state of the art:~~

: Is there a constant-factor approximation to edit distance that runs in near-linear time?**Open question 1**

**Update [5/2020]: **Yay! A new near-linear time algorithm by Andoni and Nosatzki!

## Algorithm sketch

Here is a sketch of *an *algorithm. It is somewhat different from the algorithm in the paper because I wrote this post before finding the full paper online.

### Step 0: window-compatible matchings

We partition each string into *windows*, or consecutive substrings of length each. We then restrict our attention to *window-compatible* matchings: that is, instead of looking for the globally optimum way to transform to , we look for a partial matching between the – and -windows, and transform each -window to its matching -windows (unmatched -windows are deleted, and unmatched -windows are inserted). It turns out that restricting to window-compatible matchings is almost without loss of generality.

In order to find the optimum window-compatible matching, we can find the distances between every pair of windows, and then use a (weighted) dynamic program of size . The reason I call it “Step 0” is because so far we made zero progress on running time: we still have to compute the edit distance between pairs, and each computation takes time , so time in total.

Approximating all the pairwise distances reduces to the following problem: given threshold , compute the bipartite graph over the windows, where two windows and share an edge if . In fact it suffices to compute an approximate , where and may share an edge even if their edit distance is a little more than .

**New Goal**: Compute faster than naively computing all pairwise edit distances.

### Step 1: sparsifying

While there are many edges in , say average degree : Draw a random edge , and let be two other neighbors of , respectively. Applying the **triangle inequality** (twice), we have that , so we can immediately add to . In expectation, have neighbors each, so we discovered a total of pairs; of which we expect that roughly correspond to *new* edges in . Repeat at most times until we discovered almost all the edges in . Notice that each iteration took us time (computing all the edit distances from and ); hence in total only . Thus we reduced to the sparse case in truly subquadratic time.

The algorithm up to this point is actually due to a recent paper by Boroujeni et al; for the case when is relatively sparse, they use Grover Search to discover all the remaining edges in quantum subquadratic time. It remains to see how to do it classically…

### Step 2: when is relatively sparse

The main observation we need for this part is that if windows and are close, then in an optimum window-compatible matching they are probably not matched to -windows that are very far apart. And in the rare event that they are matched to far-apart -windows, the cost of inserting so many characters between and outweighs the cost of completely replacing if we had to. So once we have a candidate list of -windows that might match to, it is safe to only search for good matches for around each of those candidates. But when the graph is sparse, we have such a short list: the neighbors of !

We have to be a bit careful: for example, it is possible that is not matched to any of its neighbors in . But if we sample enough ‘s from some interval around , then either (i) at least one of them is matched to a neighbor in ; or (ii) doesn’t contribute much to reducing the edit distance for this interval, so it’s OK to miss some of those edges.

## Improving the approximation factor

On the back of my virtual envelope, I think the above ideas give a -approximation. But as far as genomes go, up to a -approximation, you’re as closely related to your dog as to a banana. So it would be great to improve the approximation factor:

**Open question 2**: Is there a -approximation algorithm for edit distance in truly subquadratic (or near-linear) time?

Note that only the sparsification step loses more than in approximation. Also, none of the existing fine-grained hardness results rule out an -approximation, even in linear time!

A related open problem (that is arguably more relevant to practice than computing the edit distance, see, e.g., [1]) is the approximate nearest neighbor search for the edit distance. The only known way to do it (in the worst case, there are some decent heuristics) is to embed the edit distance into l_1 [2] and use locality-sensitive hashing. This gives approximation 2^{sqrt{log d log log d}}, where d is the length of strings. Can one improve it to poly(log d)? To O(1)?

[1] https://arxiv.org/abs/1702.00093

[2] http://www.cs.technion.ac.il/~rabani/Papers/OstrovskyR-JACM-revised.pdf

LikeLiked by 1 person

Thanks Ilya! Embarrassingly, the state-of-the-art conditional hardness result for approximate nearest neighbor with edit distance metric also goes through (and thus only gives hardness for a tiny -factor.

LikeLiked by 1 person

Indeed, the edit distance contains a copy of l_1, but we currently have *no* evidence that ANN for edit distance is harder than for l_1. As our framework [1] suggests, the heart of the matter is to understand how well shortest-path metrics of expander graphs embed into the edit distance.

More precisely, it would be great to show that the cutting modulus (in the sense of a short document [2]) of {0, 1}^d equipped with the edit distance is poly(log d). I think it should be true.

[1] https://ilyaraz.org/static/papers/spectral_gap.pdf

[2] https://ilyaraz.org/static/misc/cutting_modulus.pdf

LikeLike

Wow! Thanks Aviad for the the nice blog post on this. In hindsight it looks our SODA’18 paper: was so close — missing only that last step.

LikeLiked by 1 person