I was excited to see “Approximating Edit Distance Within Constant Factor in Truly Sub-Quadratic Time” by Chakraborty, Das, Goldenberg, Koucky, and Saks in this year’s list of FOCS accepted papers. As you can guess from the title, their main result is a constant-factor approximation for edit distance in truly subquadratic time (i.e. , for some constant ). I spent quite a bit of time thinking about this problem in the past couple of years, and wiser people than me have spent much more time for a couple of decades. Congratulations!!
Given strings and of characters each, the textbook dynamic programming algorithm finds their edit distance in time (if you haven’t seen this in your undergrad algorithms class, consider asking your university for a refund on your tuition). Recent complexity breakthroughs  show that under plausible assumptions like SETH, quadratic time is almost optimal for exact algorithms. This is too bad, because we like to compute edit distance between very long strings, like entire genomes of two organisms (see also this article in Quanta). There is a sequence of near-linear time algorithms with improved approximation factors , but until now the state of the art was polylogarithmic; actually for near-linear time, this is still the state of the art:
Open question 1: Is there a constant-factor approximation to edit distance that runs in near-linear time?
Here is a sketch of an algorithm. It is somewhat different from the algorithm in the paper because I wrote this post before finding the full paper online.
Step 0: window-compatible matchings
We partition each string into windows, or consecutive substrings of length each. We then restrict our attention to window-compatible matchings: that is, instead of looking for the globally optimum way to transform to , we look for a partial matching between the – and -windows, and transform each -window to its matching -windows (unmatched -windows are deleted, and unmatched -windows are inserted). It turns out that restricting to window-compatible matchings is almost without loss of generality.
In order to find the optimum window-compatible matching, we can find the distances between every pair of windows, and then use a (weighted) dynamic program of size . The reason I call it “Step 0” is because so far we made zero progress on running time: we still have to compute the edit distance between pairs, and each computation takes time , so time in total.
Approximating all the pairwise distances reduces to the following problem: given threshold , compute the bipartite graph over the windows, where two windows and share an edge if . In fact it suffices to compute an approximate , where and may share an edge even if their edit distance is a little more than .
New Goal: Compute faster than naively computing all pairwise edit distances.
Step 1: sparsifying
While there are many edges in , say average degree : Draw a random edge , and let be two other neighbors of , respectively. Applying the triangle inequality (twice), we have that , so we can immediately add to . In expectation, have neighbors each, so we discovered a total of pairs; of which we expect that roughly correspond to new edges in . Repeat at most times until we discovered almost all the edges in . Notice that each iteration took us time (computing all the edit distances from and ); hence in total only . Thus we reduced to the sparse case in truly subquadratic time.
The algorithm up to this point is actually due to a recent paper by Boroujeni et al; for the case when is relatively sparse, they use Grover Search to discover all the remaining edges in quantum subquadratic time. It remains to see how to do it classically…
Step 2: when is relatively sparse
The main observation we need for this part is that if windows and are close, then in an optimum window-compatible matching they are probably not matched to -windows that are very far apart. And in the rare event that they are matched to far-apart -windows, the cost of inserting so many characters between and outweighs the cost of completely replacing if we had to. So once we have a candidate list of -windows that might match to, it is safe to only search for good matches for around each of those candidates. But when the graph is sparse, we have such a short list: the neighbors of !
We have to be a bit careful: for example, it is possible that is not matched to any of its neighbors in . But if we sample enough ‘s from some interval around , then either (i) at least one of them is matched to a neighbor in ; or (ii) doesn’t contribute much to reducing the edit distance for this interval, so it’s OK to miss some of those edges.
Improving the approximation factor
On the back of my virtual envelope, I think the above ideas give a -approximation. But as far as genomes go, up to a -approximation, you’re as closely related to your dog as to a banana. So it would be great to improve the approximation factor:
Open question 2: Is there a -approximation algorithm for edit distance in truly subquadratic (or near-linear) time?
Note that only the sparsification step loses more than in approximation. Also, none of the existing fine-grained hardness results rule out an -approximation, even in linear time!