Edit Distance: Recovering Alignments using Black-Box Estimation Algorithms

By Moses Charikar, Ofir Geri, Michael P. Kim, and William Kuszmaul.

The edit distance between two strings x and y is the minimum number of insertion, deletion, or substitution operations required to transform x into y. For example, the edit distance between “track” and “trek” is 2 (remove ‘a’ or ‘c’ and perform one substitution). One of the most important applications of edit distance is in computational biology, as a tool to determine how similar two genetic sequences are.

Computing edit distance is a textbook application of dynamic programming and can be performed in time O(n^2) for strings of length n. The dynamic programming algorithm can be modified to output not just the edit distance itself, but also a corresponding sequence of edits. The quadratic runtime of the algorithm is prohibitively large for massive datasets (e.g., genomic data), and recent conditional lower bounds (by Backurs and Indyk and by Abboud, Hansen, Vassilevska Williams, and Williams) suggest that no algorithm with run time O(n^{2-\varepsilon}) exists.

The challenge of efficiently computing edit distance has motivated the development of approximation algorithms which run in close to linear time. The state-of-the-art approximation algorithm, due to Andoni, Krauthgamer, and Onak, estimates edit distance within a factor of (\log{n})^{O(1/\varepsilon)} and runs in time O(n^{1+\epsilon}). While these algorithms produce estimates of the edit distance, the equally important question of actually producing an alignment (i.e., the sequence of edits) has received far less attention. To the best of our knowledge, the current best approximation algorithm for producing an alignment has a polynomial multiplicative error.

The gap between algorithms for estimating edit distance and producing an alignment raises a natural question: Can the current estimation algorithms be used to produce an alignment, or are different techniques needed? In our recent work, we show that any estimation algorithm can be used in a black-box fashion to recover an alignment with modest loss in run time and approximation ratio. Informally, our result takes an estimation algorithm with approximation ratio \gamma(n) that runs in time T(n), and produces an algorithm which recovers a \gamma(n)^{O(\frac{1}{\epsilon})}-approximate alignment in time \tilde{O}\left(T(n) \cdot \frac{1}{\epsilon} \cdot n^{\epsilon}\right). Plugging in the result of Andoni, Krauthgamer, and Onak, we can recover a (\log{n})^{O(1/\varepsilon^2)}-approximate alignment in time O(n^{1+\epsilon}).

A high-level description of the algorithm is as follows. Let \mathcal{A} be an algorithm for edit distance estimation. Given strings x and y, we break x into m equal parts (for some m). We then examine at a low granularity the options for splitting y into m parts. Using \mathcal{A}, we approximate the distance between the i-th part of x and each of the options for the i-th part of y. Then, we use dynamic programming to find a partition of y into m parts which approximately minimizes the sum of the edit distances between parts of x and y. Finally, we recurse on each of the individual parts. The main ingredient in the analysis is showing that if we consider only a small number of options for each part in y‘s partition, we can still guarantee the existence of a partition which closely approximates the original edit distance between x and y.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: