So after a lot of labor, I have managed to churn out the first release of my address matching engine. I still have a couple of modules to add, but there's always pressure from the business side to release much before I am comfortable. So I put together all that I could asap and released it. I'm getting a match of about 60-70% at block (sub-locality) level for totally free running addresses from live data (some of which were extremely absurd).
I must say I impressed myself! Unfortunately for me, before I perfect something, or finish what I'm doing, something of 'business importance' comes up and I have to move on. One thing I'm sure of though, is that I'm definitely going to finish this paper and publish it in the next 2 months. Cheers to me!
Saturday, June 20, 2009
Wednesday, June 3, 2009
Research is on!
It's been some time that I haven't posted anything. I got busy with my research, the same I've been meaning to commence for some time now. I've finally started work on my address matching engine. I'm using the n-gram model for address recognition. One stage of implementation is already over. I've also written a small spell checker. I used the concept of Xapian here. I have a tri-gram index for filtering the matches. After this I am using the Edit (Levenshtein) Distance on the filtered results.
Once the corrected keywords are obtained, I run the address string through the n-gram matcher. I am using QDBM as the data store and Boost.org C++ libraries. Even though it is in an extremely preliminary stage, my boss forced me to run a match on some address strings sent by a client. Even on a limited data set and an incomplete engine, I am getting a match rate excess of 40% at the block level. Of the 50-52% not matched, about 40% addresses were too ambiguous. I have not implemented features for input filtering and aliasing as yet. So my estimate is that once that is done, the match rate should reach at-least 75%.
I've also started working on my paper in which I intend to publish this work. Hope it works out and serves my ulterior motive :)
Once the corrected keywords are obtained, I run the address string through the n-gram matcher. I am using QDBM as the data store and Boost.org C++ libraries. Even though it is in an extremely preliminary stage, my boss forced me to run a match on some address strings sent by a client. Even on a limited data set and an incomplete engine, I am getting a match rate excess of 40% at the block level. Of the 50-52% not matched, about 40% addresses were too ambiguous. I have not implemented features for input filtering and aliasing as yet. So my estimate is that once that is done, the match rate should reach at-least 75%.
I've also started working on my paper in which I intend to publish this work. Hope it works out and serves my ulterior motive :)
Subscribe to:
Posts (Atom)