Sunday, May 17, 2009

Levenshtein Distance Algorithm: C Implementation

I needed to implement this algorithm for the new geo-coding engine I an working on. I searched the net and got a few decent implementations. I also looked at the Java implementation from the Apache Lucene library (http://lucene.apache.org). However, none satisfied the performance I was looking for. So Working on a few of them, I have written the following piece of code.

Basically it finds the distance between two strings; that is the number of edits (inserts/deletes/updates) it takes to convert one string to the other. For more details, visit http://en.wikipedia.org/wiki/Levenshtein_distance.

Since I was writing the code on Microsoft Visual Studio, I thought I might use the new operator for memory allocation. This was because Stroustrup says that the new operator does what malloc does, but better. It didn't work for my trial though. The execution time increased by two to three times when I did this! So I stuck to the basic C code and got reasonable results. It could match 1,000,000 pairs of about 50 characters each in 20 seconds on my machine (with OX flag on VC++).

Here's the code...


#define MIN_TWO(X, Y) ((X) < (Y) ? (X) : (Y))
#define MIN_THREE(X,Y,Z) ((MIN_TWO(MIN_TWO(X,Y),Z)))

int levenshtein_distance(char *s,int sLen, char*t, int tLen, float fuzzy = 0.5f);

/****************************************/
/*Implementation of Levenshtein distance*/
/****************************************/

/*Compute levenshtein distance between s and t*/
int levenshtein_distance(char *s, int sLen,char*t, int tLen,float fuzzy) {
//Step 1
int k,i,j,n,m,cost,*d,distance;
/* Maximum distance to allow a better best case. Can be changed when
I think of something better (Or re-read the lucene source :) )
*/
int maxDist = (int)(sLen *fuzzy), temp,p1,p2,p3;
maxDist = (maxDist<1)?1:maxDist;
n = sLen;
m = tLen;
if(n!=0&&m!=0) {
d=(int*)malloc((sizeof(int))*(m+1)*(n+1));
m++;
n++;
//Step 2
for(k=0;k<n;k++)
d[k]=k;
for(k=0;k<m;k++)
d[k*n]=k;
//Step 3 and 4
for(i=1;i<n;i++) {
for(j=1;j<m;j++) {
//Step 5
cost = (s[i-1]==t[j-1])?0:1;
temp = j*n;
p1 = d[temp-n+i]+1;
p2 = d[temp+i-1]+1;
p3 = d[temp-n+i-1]+cost;
d[temp+i] = MIN_THREE(p1,p2,p3);
}
if(d[i*n+i]>maxDist) {
//distance=sLen;
free(d);
return sLen;
}
}
distance=d[n*m-1];
free(d);
return distance;
}
else
return -1; //a negative return value means that one or both strings are empty.
}

No comments: