Wednesday, June 3, 2009

Research is on!

It's been some time that I haven't posted anything. I got busy with my research, the same I've been meaning to commence for some time now. I've finally started work on my address matching engine. I'm using the n-gram model for address recognition. One stage of implementation is already over. I've also written a small spell checker. I used the concept of Xapian here. I have a tri-gram index for filtering the matches. After this I am using the Edit (Levenshtein) Distance on the filtered results.

Once the corrected keywords are obtained, I run the address string through the n-gram matcher. I am using QDBM as the data store and Boost.org C++ libraries. Even though it is in an extremely preliminary stage, my boss forced me to run a match on some address strings sent by a client. Even on a limited data set and an incomplete engine, I am getting a match rate excess of 40% at the block level. Of the 50-52% not matched, about 40% addresses were too ambiguous. I have not implemented features for input filtering and aliasing as yet. So my estimate is that once that is done, the match rate should reach at-least 75%.

I've also started working on my paper in which I intend to publish this work. Hope it works out and serves my ulterior motive :)

No comments: