Sunday, October 31, 2010

Long time no see....

Its been a long time I haven't written anything. Not that I didn't feel like, but just didn't. My primary inspiration was away from me all this time and now that she's back... so am I!

Gini had been in Calcutta for the past few months and looks like Calcutta got to her! Its quite an effort getting her up and going for a run now, besides her extra flab :). For some reason Calcutta does make you lazy. Everything will happen at its own slow pace and patience becomes the way of life. I enjoy it for sometime, but then I get restless. I feel this big void in myself... I'm doing nothing! Have I become a rolling cog in the corporate machinery who likes the regular maintenance breaks but can't live without the machine!

On the other hand, a city like Bombay.. er Mumbai scares me! Everyone walks around with a purpose and they're inadvertently short on time. I guess thats what balances life!

Friday, October 2, 2009

Bangalored!

So a lot has happened over the past couple of months. I quit MapmyIndia and moved to Bangalore. I decided to move away from GIS and get back to conventional IT work. So I've joined Akamai. Its a very different and nice feeling coming from a small company to an established mid-sized company. There are trade-offs, but as long as the net is positive, I think its worth it. So here I am!

I'm liking Bangalore; well most of it. I love the weather, though my wife isn't too fond of it. I love the fact that there are so many weekend getaways all around. I hate the fact that the city shuts down at 10 and is practically dead at 11! I hate the roads and the amazing one ways set up by the traffic police. There is not a single place I can go to and come back from taking the same route!

The people are generally nice here. I managed to get a nice apartment walking distance from my office. Gini is having a ball. She gets to run around a much bigger house and go for nice walks at least three times a day. She's even got her courage back! She isn't scared of street dogs anymore!

We went to Goa over the last weekend... But lets save that for the next post :-)

Saturday, June 20, 2009

Build 0.1 :)

So after a lot of labor, I have managed to churn out the first release of my address matching engine. I still have a couple of modules to add, but there's always pressure from the business side to release much before I am comfortable. So I put together all that I could asap and released it. I'm getting a match of about 60-70% at block (sub-locality) level for totally free running addresses from live data (some of which were extremely absurd).

I must say I impressed myself! Unfortunately for me, before I perfect something, or finish what I'm doing, something of 'business importance' comes up and I have to move on. One thing I'm sure of though, is that I'm definitely going to finish this paper and publish it in the next 2 months. Cheers to me!

Wednesday, June 3, 2009

Research is on!

It's been some time that I haven't posted anything. I got busy with my research, the same I've been meaning to commence for some time now. I've finally started work on my address matching engine. I'm using the n-gram model for address recognition. One stage of implementation is already over. I've also written a small spell checker. I used the concept of Xapian here. I have a tri-gram index for filtering the matches. After this I am using the Edit (Levenshtein) Distance on the filtered results.

Once the corrected keywords are obtained, I run the address string through the n-gram matcher. I am using QDBM as the data store and Boost.org C++ libraries. Even though it is in an extremely preliminary stage, my boss forced me to run a match on some address strings sent by a client. Even on a limited data set and an incomplete engine, I am getting a match rate excess of 40% at the block level. Of the 50-52% not matched, about 40% addresses were too ambiguous. I have not implemented features for input filtering and aliasing as yet. So my estimate is that once that is done, the match rate should reach at-least 75%.

I've also started working on my paper in which I intend to publish this work. Hope it works out and serves my ulterior motive :)

Sunday, May 17, 2009

Levenshtein Distance Algorithm: C Implementation

I needed to implement this algorithm for the new geo-coding engine I an working on. I searched the net and got a few decent implementations. I also looked at the Java implementation from the Apache Lucene library (http://lucene.apache.org). However, none satisfied the performance I was looking for. So Working on a few of them, I have written the following piece of code.

Basically it finds the distance between two strings; that is the number of edits (inserts/deletes/updates) it takes to convert one string to the other. For more details, visit http://en.wikipedia.org/wiki/Levenshtein_distance.

Since I was writing the code on Microsoft Visual Studio, I thought I might use the new operator for memory allocation. This was because Stroustrup says that the new operator does what malloc does, but better. It didn't work for my trial though. The execution time increased by two to three times when I did this! So I stuck to the basic C code and got reasonable results. It could match 1,000,000 pairs of about 50 characters each in 20 seconds on my machine (with OX flag on VC++).

Here's the code...


#define MIN_TWO(X, Y) ((X) < (Y) ? (X) : (Y))
#define MIN_THREE(X,Y,Z) ((MIN_TWO(MIN_TWO(X,Y),Z)))

int levenshtein_distance(char *s,int sLen, char*t, int tLen, float fuzzy = 0.5f);

/****************************************/
/*Implementation of Levenshtein distance*/
/****************************************/

/*Compute levenshtein distance between s and t*/
int levenshtein_distance(char *s, int sLen,char*t, int tLen,float fuzzy) {
//Step 1
int k,i,j,n,m,cost,*d,distance;
/* Maximum distance to allow a better best case. Can be changed when
I think of something better (Or re-read the lucene source :) )
*/
int maxDist = (int)(sLen *fuzzy), temp,p1,p2,p3;
maxDist = (maxDist<1)?1:maxDist;
n = sLen;
m = tLen;
if(n!=0&&m!=0) {
d=(int*)malloc((sizeof(int))*(m+1)*(n+1));
m++;
n++;
//Step 2
for(k=0;k<n;k++)
d[k]=k;
for(k=0;k<m;k++)
d[k*n]=k;
//Step 3 and 4
for(i=1;i<n;i++) {
for(j=1;j<m;j++) {
//Step 5
cost = (s[i-1]==t[j-1])?0:1;
temp = j*n;
p1 = d[temp-n+i]+1;
p2 = d[temp+i-1]+1;
p3 = d[temp-n+i-1]+cost;
d[temp+i] = MIN_THREE(p1,p2,p3);
}
if(d[i*n+i]>maxDist) {
//distance=sLen;
free(d);
return sLen;
}
}
distance=d[n*m-1];
free(d);
return distance;
}
else
return -1; //a negative return value means that one or both strings are empty.
}

Saturday, April 25, 2009

To Heaven And Back



Got some free time on a long weekend. Was thinking what the wife and I could do. Anyways, I met Ashit (my childhood best friend) in the evening and he told me that he and his friends from office were going to a place called the Himalayan Trout House (HTH), in Tirthan Valley, Himachal Pradesh, near Kullu. It is a little over 500 Km from Delhi. So it was decided that the we would drive down there on Saturday early in the morning.

There was one little snag though, we can't leave Ginnie, our one year old Labrador behind. So I talked to Christopher, the person running the HTH, and asked if I could bring her along. Thankfully he said OK, though he warned me of his three dogs (much needed). So the deal was sealed and we left Delhi at about 5 in the morning. Definitely a good decision as we crossed the Delhi border in just half an hour.

This was my first 200+ Km drive in my Aveo Uva and it proved to be worth every penny I spent on the car. Our first stop was just outside Chandigarh (which we reached in less than 3 hours). We stopped at a Dhaba for tea and Ginnie enjoyed running around and getting herself dirty. The dhaba owner was friendly towards her, so we just let her be.

With a couple of more stops (mainly for Ginnie) we continued. We made the mistake of taking NH 21A instead of NH 21 to Bilaspur. Even though it is a National Highway, it is in a very bad state. Plus, it adds at least 50 Km of extra hill way. Anyways, we continued and finally reached our destination comfortably, thanks to the guidance of MapmyIndia website :), and a little help from Ashit once we were on the state highway.

We were greeted by Thumki (an eight year old Lhasa) and Ready (a two year old mountain dog) pouncing on Ginnie. Thankfully Krishwa (the four year old alpha male) was tied up. However, the two later became good friends with Ginnie.

The first day we just lazed around, and yeah, had the most delicious trouts ever! In spite of my wife being a Bengali, I have not been able to get myself to appreciate fish. The heavenly trout changed everything (I even had Macher-Jhol-Bhat with Aloo and all a couple of days back)! The next day we went trekking to a waterfall. Nice short trek and a beautiful waterfall. We were the only ones there, along with Thumki and our guide, Deepak. Ginnie had a blast jumping into the water and splashing around, and of course getting dirtier in the mud :).

After getting back, we went trout fishing. Christopher arranged for the licenses and fishing rods. We didn't go for angling as we were short on time, though I'd like to try it out some time soon.

Oh! I forgot to mention Kabir, Chirstopher's son, who is one of the cutest kids I've seen. Full of life and energy. When we were leaving the next morning, he refused to come and see us :).

After clearing up the last remains of trout from the Himalayan Trout House (of course Ashit was a big contributor to this) we were ready to leave. We had to wait till about 2 for the pouring rain to stop, for no good. Finally we decided to move or else we won't make it to office the next morning. And battling the pouring rain, the countless trucks and the dimming light, we managed to get back to Delhi at about 2 at night. A lovely weekend to a place I'd like to go again sometime.

Wednesday, March 4, 2009

Musings and Gyan!

I started work on a new geographic engine for geo-coding in C. Whatever I try to do, there will always be an overhead in Java. For example, for the routing implementation, I need a good priority queue implementation. I still have no idea how to get around the fact that only call by value is allowed. So if a change has to be made at the lower end of a heap (Say Fibonacci or Radix or a combo) what's the best way to update the structure. Effectively, the computation time goes for a toss and its best to use plain old binary heap.

So its time to go back to the basics again and write a good and portable C code. Why C and not C++... Somehow I haven't been able to convince myself on the advantages of using C++. Ok, it allows object oriented programming and all the fancy terms that come along with it... But unless if I write a good C code, and expose the APIs well, do I really need C++?

say a simple class for a hello world...

class test {
private:
int c;
public:
void setC(int &a) {
c = a;
}
int getC() {
return c;
}
};

same in C will be:

typedef struct _test {
int c;
}test;

void
setC(test *t,int *a) {
t->c = *a;
}

int getC(test *t) {
return t->c;
}

Gives a pretty good picture of what's the problem using C.

1) Access permission to the structure variables
2) The boon and bane, the ring of power, the STAR *; also known as the POINTER!

The idea is to get more people to program easily, reducing dependence on individual ability. The cost, whatever one may say, will be the performance. I can probably go on to include programming languages like Java into this, but I don't want to make things ugly.

Now, how do I extend this to the web...
1) CGI
2) Scripting languages

The good old CGI has always been there. It has its pros and cons. The advent and the growing popularity of languages like PHP and Python has given the second method a whole new light. And throw into that software like SWIG and BOOST... voila! Its a killer combo.

I tried this method for the first time for a small code we needed for image geo-tagging. So on top of Exiv2, I wrote a wrapper in C++ (Hehe...) and using Swig, I extended it to Python and PHP5. It works fabulously!

I have heard that JNI has a overhead problems etc. But if the C implementation is great, it might just offset that.

Then comes the cost of deployment.
Case 1: Shared/Dedicated hosting with no capital expenditure

A Linux server with Apache2 and mod-php5 or mod-python is the cheapest option! If Linux isn't an issue (If it is you're in the wrong blog!)

Case 2: Owned Servers

Besides the obvious fact stated above, lets go a little deeper into the enterprise architecture.

A typical Java hosting model will have a web server balancing single or multiple instances of application servers (Eg: JBoss or Glassfish) or servlet containers (Eg: Tomcat) using mod_jk or mod_proxy. So the web server (Apache2) will basically act as a load balancer and static content dispatcher. Great Setup! But investment cost is reasonable. A stand-by fail over at the web server level is quite easy. Setup requires and tuning requires some expertise, which isn't to hard to get either. Session management can be taken care of at the application server level leaving apache to communicate plain and simple using ajp.

Lets now try to see a PHP scenario.

One apache2 server with mod_php5. Can I split it here without losing the benefits of Apache modules? I don't think so. So This is the setup required. Next, site gets popular and load increases. I have to go for a load balancing solution. Either I do it using a proxy like Squid or a hardware load balancer. The latter is always a better option, but cost may be prohibitive. Session handling will need a change now. As per me, the best solution is to change PHP session handling from the local file to a Memcache instance.

This leaves us with the following setup:
1) Load balancer
2) Apache2/PHP5 servers
3) Memcache server

The PHP can be replaced easily with Python or Perl. And coming back to the start of the discussion, allows well written code in C/C++ to execute fast.


So the cost definitely gets higher from a setup point of view, compared to the first setup. But the most expensive component will be a one time cost. The former allows scaling to a good level, beyond which again the load balancer will come into play, though much later. The servers, however, will usually be much more powerful (Coming totally out of experience).

So my suggestion, have developers, have some money to through initially (from a good investment etc), go for the second setup. Makes much more sense.

So that's where I'm heading. I'll write this code with the objective being lowest possible response time. And then will think of extending it to Python, PHP and maybe even Java (JNI).