Approximating Edit Distance

I was excited to see “Approximating Edit Distance Within Constant Factor in Truly Sub-Quadratic Time” by Chakraborty, Das, Goldenberg, Koucky, and Saks in this year’s list of FOCS accepted papers. As you can guess from the title, their main result is a constant-factor approximation for edit distance in truly subquadratic time (i.e. O(n^{2-\delta}), for some constant \delta). I spent quite a bit of time thinking about this problem in the past couple of years, and wiser people than me have spent much more time for a couple of decades. Congratulations!!

Given strings A and B of n characters each, the textbook dynamic programming algorithm finds their edit distance in time O(n^{2}) (if you haven’t seen this in your undergrad algorithms class, consider asking your university for a refund on your tuition). Recent complexity breakthroughs [1][2] show that under plausible assumptions like SETH, quadratic time is almost optimal for exact algorithms. This is too bad, because we like to compute edit distance between very long strings, like entire genomes of two organisms (see also this article in Quanta). There is a sequence of near-linear time algorithms with improved approximation factors [1][2][3][4], but until now the state of the art was polylogarithmic; actually for near-linear time, this is still the state of the art:

Open question 1: Is there a constant-factor approximation to edit distance that runs in near-linear time?

Update [5/2020]: Yay! A new near-linear time algorithm by Andoni and Nosatzki!

Algorithm sketch

Here is a sketch of an algorithm. It is somewhat different from the algorithm in the paper because I wrote this post before finding the full paper online.

Step 0: window-compatible matchings

We partition each string into t windows, or consecutive substrings of length d:=n/t each. We then restrict our attention to window-compatible matchings: that is, instead of looking for the globally optimum way to transform A to B, we look for a partial matching between the A– and B-windows, and transform each A-window to its matching B-windows (unmatched A-windows are deleted, and unmatched B-windows are inserted). It turns out that restricting to window-compatible matchings is almost without loss of generality.

In order to find the optimum window-compatible matching, we can find the distances between every pair of windows, and then use a (weighted) dynamic program of size t\times t\ll n^{2}. The reason I call it “Step 0” is because so far we made zero progress on running time: we still have to compute the edit distance between t^{2} pairs, and each computation takes time d^{2}, so t^{2}d^{2}=n^{2} time in total.

Approximating all the pairwise distances reduces to the following problem: given threshold \Delta, compute the bipartite graph G_{\Delta} over the windows, where two windows A_{i} and B_{j} share an edge if ED (A_i, B_j) \leq \Delta. In fact it suffices to compute an approximate \widetilde{G}_{\Delta}, where A_{i} and B_{j} may share an edge even if their edit distance is a little more than \Delta.

New Goal: Compute \widetilde{G}_{\Delta} faster than naively computing all t^{2} pairwise edit distances.

Step 1: sparsifying G_{\Delta}

While there are many edges in G_{\Delta}\setminus\widetilde{G}_{\Delta}, say average degree \geq t^{3/4}: Draw a random edge (A_{1},B_{1}), and let B_{j}, A_i be two other neighbors of A_{1},B_{1}, respectively. Applying the triangle inequality (twice), we have that ED(A_{i},B_{j})\leq3\Delta, so we can immediately add (A_{i},B_{j}) to \widetilde{G}_{\Delta}. In expectation, A_{1},B_{1} have \geq t^{3/4} neighbors each, so we discovered a total of \geq t^{6/4} pairs; of which we expect that roughly \geq t^{5/4} correspond to new edges in \widetilde{G}_{\Delta}. Repeat at most t^{3/4} times until we discovered almost all the edges in G_{\Delta}. Notice that each iteration took us td^{2} time (computing all the edit distances from A_{1} and B_{1}); hence in total only t^{7/4}d^{2}\ll t^{2}d^{2}=n^{2}. Thus we reduced to the sparse case in truly subquadratic time.

The algorithm up to this point is actually due to a recent paper by Boroujeni et al; for the case when G_{\Delta} is relatively sparse, they use Grover Search to discover all the remaining edges in quantum subquadratic time. It remains to see how to do it classically…

Step 2: when G_{\Delta} is relatively sparse

The main observation we need for this part is that if windows A_{i} and A_{j} are close, then in an optimum window-compatible matching they are probably not matched to B-windows that are very far apart. And in the rare event that they are matched to far-apart B-windows, the cost of inserting so many characters between A_{i} and A_{j} outweighs the cost of completely replacing A_{j} if we had to. So once we have a candidate list of B-windows that A_{i} might match to, it is safe to only search for good matches for A_{j} around each of those candidates. But when the graph G_{\Delta} is sparse, we have such a short list: the neighbors of A_{i}!

We have to be a bit careful: for example, it is possible that A_{i} is not matched to any of its neighbors in G_{\Delta}. But if we sample enough A_{i}‘s from some interval around A_{j}, then either (i) at least one of them is matched to a neighbor in G_{\Delta}; or (ii) G_{\Delta} doesn’t contribute much to reducing the edit distance for this interval, so it’s OK to miss some of those edges.

Improving the approximation factor

On the back of my virtual envelope, I think the above ideas give a (3+\varepsilon)-approximation. But as far as genomes go, up to a (3+\varepsilon)-approximation, you’re as closely related to your dog as to a banana. So it would be great to improve the approximation factor:

Open question 2: Is there a (1+\varepsilon)-approximation algorithm for edit distance in truly subquadratic (or near-linear) time?

Note that only the sparsification step loses more than (1+\varepsilon) in approximation. Also, none of the existing fine-grained hardness results rule out an (1+\varepsilon)-approximation, even in linear time!

4 Comments on Approximating Edit Distance

  1. A related open problem (that is arguably more relevant to practice than computing the edit distance, see, e.g., [1]) is the approximate nearest neighbor search for the edit distance. The only known way to do it (in the worst case, there are some decent heuristics) is to embed the edit distance into l_1 [2] and use locality-sensitive hashing. This gives approximation 2^{sqrt{log d log log d}}, where d is the length of strings. Can one improve it to poly(log d)? To O(1)?

    [1] https://arxiv.org/abs/1702.00093
    [2] http://www.cs.technion.ac.il/~rabani/Papers/OstrovskyR-JACM-revised.pdf

    Liked by 1 person

  2. aviad.rubinstein // July 20, 2018 at 11:56 am // Reply

    Thanks Ilya! Embarrassingly, the state-of-the-art conditional hardness result for approximate nearest neighbor with edit distance metric also goes through l_1 (and thus only gives hardness for a tiny (1+\varepsilon)-factor.

    Liked by 1 person

  3. Mohammad Hajiaghayi // July 20, 2018 at 12:56 pm // Reply

    Wow! Thanks Aviad for the the nice blog post on this. In hindsight it looks our SODA’18 paper: was so close — missing only that last step.

    Liked by 1 person

1 Trackback / Pingback

  1. Quantum DNA sequencing & the ultimate hardness hypothesis – Theory Dish

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: