Approximate String Matching using Within-Word Parallelism

Software Practice and Experience}, 24(4), 337-362 (April 1994)

Alden H. Wright
Computer Science
University of Montana
Missoula, MT 59812
January 8, 1993
wright@cs.umt.edu


SUMMARY:

Given a text string, a pattern string, and an integer k, the problem of approximate string matching with k differences is to find all substrings of the text string whose edit distance from the pattern string is less than k. The edit distance between two string is defined as the minimum number of differences, where a difference can be a substitution, insertion, or deletion of a single character. An implementation of the dynamic programming algorithm for this problem is given that packs several characters and mod-4 integers into a computer word. Thus, it is a parallelization of the conventional implementation that runs on ordinary processors. Since a small alphabet means that characters have short binary codes, the degree of parallelism is greatest for small alphabets and for processors with long words. For an alphabet of size 8 or smaller and a 64 bit processor, a 21-fold parallelism over the conventional algorithm can be obtained. Empirical comparisons to the basic dynamic programming algorithm, to a version of Ukkonen's algorithm, and to the algorithm of Galil and Park, are given.
KEY WORDS Approximate string matching
Approximate string searching
Edit distance
Within-word parallelism
Parallel algorithm
Parallel computation