Friday, October 2, 2009

Bangalored!

So a lot has happened over the past couple of months. I quit MapmyIndia and moved to Bangalore. I decided to move away from GIS and get back to conventional IT work. So I've joined Akamai. Its a very different and nice feeling coming from a small company to an established mid-sized company. There are trade-offs, but as long as the net is positive, I think its worth it. So here I am!

I'm liking Bangalore; well most of it. I love the weather, though my wife isn't too fond of it. I love the fact that there are so many weekend getaways all around. I hate the fact that the city shuts down at 10 and is practically dead at 11! I hate the roads and the amazing one ways set up by the traffic police. There is not a single place I can go to and come back from taking the same route!

The people are generally nice here. I managed to get a nice apartment walking distance from my office. Gini is having a ball. She gets to run around a much bigger house and go for nice walks at least three times a day. She's even got her courage back! She isn't scared of street dogs anymore!

We went to Goa over the last weekend... But lets save that for the next post :-)

Saturday, June 20, 2009

Build 0.1 :)

So after a lot of labor, I have managed to churn out the first release of my address matching engine. I still have a couple of modules to add, but there's always pressure from the business side to release much before I am comfortable. So I put together all that I could asap and released it. I'm getting a match of about 60-70% at block (sub-locality) level for totally free running addresses from live data (some of which were extremely absurd).

I must say I impressed myself! Unfortunately for me, before I perfect something, or finish what I'm doing, something of 'business importance' comes up and I have to move on. One thing I'm sure of though, is that I'm definitely going to finish this paper and publish it in the next 2 months. Cheers to me!

Wednesday, June 3, 2009

Research is on!

It's been some time that I haven't posted anything. I got busy with my research, the same I've been meaning to commence for some time now. I've finally started work on my address matching engine. I'm using the n-gram model for address recognition. One stage of implementation is already over. I've also written a small spell checker. I used the concept of Xapian here. I have a tri-gram index for filtering the matches. After this I am using the Edit (Levenshtein) Distance on the filtered results.

Once the corrected keywords are obtained, I run the address string through the n-gram matcher. I am using QDBM as the data store and Boost.org C++ libraries. Even though it is in an extremely preliminary stage, my boss forced me to run a match on some address strings sent by a client. Even on a limited data set and an incomplete engine, I am getting a match rate excess of 40% at the block level. Of the 50-52% not matched, about 40% addresses were too ambiguous. I have not implemented features for input filtering and aliasing as yet. So my estimate is that once that is done, the match rate should reach at-least 75%.

I've also started working on my paper in which I intend to publish this work. Hope it works out and serves my ulterior motive :)

Sunday, May 17, 2009

Levenshtein Distance Algorithm: C Implementation

I needed to implement this algorithm for the new geo-coding engine I an working on. I searched the net and got a few decent implementations. I also looked at the Java implementation from the Apache Lucene library (http://lucene.apache.org). However, none satisfied the performance I was looking for. So Working on a few of them, I have written the following piece of code.

Basically it finds the distance between two strings; that is the number of edits (inserts/deletes/updates) it takes to convert one string to the other. For more details, visit http://en.wikipedia.org/wiki/Levenshtein_distance.

Since I was writing the code on Microsoft Visual Studio, I thought I might use the new operator for memory allocation. This was because Stroustrup says that the new operator does what malloc does, but better. It didn't work for my trial though. The execution time increased by two to three times when I did this! So I stuck to the basic C code and got reasonable results. It could match 1,000,000 pairs of about 50 characters each in 20 seconds on my machine (with OX flag on VC++).

Here's the code...


#define MIN_TWO(X, Y) ((X) < (Y) ? (X) : (Y))
#define MIN_THREE(X,Y,Z) ((MIN_TWO(MIN_TWO(X,Y),Z)))

int levenshtein_distance(char *s,int sLen, char*t, int tLen, float fuzzy = 0.5f);

/****************************************/
/*Implementation of Levenshtein distance*/
/****************************************/

/*Compute levenshtein distance between s and t*/
int levenshtein_distance(char *s, int sLen,char*t, int tLen,float fuzzy) {
//Step 1
int k,i,j,n,m,cost,*d,distance;
/* Maximum distance to allow a better best case. Can be changed when
I think of something better (Or re-read the lucene source :) )
*/
int maxDist = (int)(sLen *fuzzy), temp,p1,p2,p3;
maxDist = (maxDist<1)?1:maxDist;
n = sLen;
m = tLen;
if(n!=0&&m!=0) {
d=(int*)malloc((sizeof(int))*(m+1)*(n+1));
m++;
n++;
//Step 2
for(k=0;k<n;k++)
d[k]=k;
for(k=0;k<m;k++)
d[k*n]=k;
//Step 3 and 4
for(i=1;i<n;i++) {
for(j=1;j<m;j++) {
//Step 5
cost = (s[i-1]==t[j-1])?0:1;
temp = j*n;
p1 = d[temp-n+i]+1;
p2 = d[temp+i-1]+1;
p3 = d[temp-n+i-1]+cost;
d[temp+i] = MIN_THREE(p1,p2,p3);
}
if(d[i*n+i]>maxDist) {
//distance=sLen;
free(d);
return sLen;
}
}
distance=d[n*m-1];
free(d);
return distance;
}
else
return -1; //a negative return value means that one or both strings are empty.
}

Saturday, April 25, 2009

To Heaven And Back



Got some free time on a long weekend. Was thinking what the wife and I could do. Anyways, I met Ashit (my childhood best friend) in the evening and he told me that he and his friends from office were going to a place called the Himalayan Trout House (HTH), in Tirthan Valley, Himachal Pradesh, near Kullu. It is a little over 500 Km from Delhi. So it was decided that the we would drive down there on Saturday early in the morning.

There was one little snag though, we can't leave Ginnie, our one year old Labrador behind. So I talked to Christopher, the person running the HTH, and asked if I could bring her along. Thankfully he said OK, though he warned me of his three dogs (much needed). So the deal was sealed and we left Delhi at about 5 in the morning. Definitely a good decision as we crossed the Delhi border in just half an hour.

This was my first 200+ Km drive in my Aveo Uva and it proved to be worth every penny I spent on the car. Our first stop was just outside Chandigarh (which we reached in less than 3 hours). We stopped at a Dhaba for tea and Ginnie enjoyed running around and getting herself dirty. The dhaba owner was friendly towards her, so we just let her be.

With a couple of more stops (mainly for Ginnie) we continued. We made the mistake of taking NH 21A instead of NH 21 to Bilaspur. Even though it is a National Highway, it is in a very bad state. Plus, it adds at least 50 Km of extra hill way. Anyways, we continued and finally reached our destination comfortably, thanks to the guidance of MapmyIndia website :), and a little help from Ashit once we were on the state highway.

We were greeted by Thumki (an eight year old Lhasa) and Ready (a two year old mountain dog) pouncing on Ginnie. Thankfully Krishwa (the four year old alpha male) was tied up. However, the two later became good friends with Ginnie.

The first day we just lazed around, and yeah, had the most delicious trouts ever! In spite of my wife being a Bengali, I have not been able to get myself to appreciate fish. The heavenly trout changed everything (I even had Macher-Jhol-Bhat with Aloo and all a couple of days back)! The next day we went trekking to a waterfall. Nice short trek and a beautiful waterfall. We were the only ones there, along with Thumki and our guide, Deepak. Ginnie had a blast jumping into the water and splashing around, and of course getting dirtier in the mud :).

After getting back, we went trout fishing. Christopher arranged for the licenses and fishing rods. We didn't go for angling as we were short on time, though I'd like to try it out some time soon.

Oh! I forgot to mention Kabir, Chirstopher's son, who is one of the cutest kids I've seen. Full of life and energy. When we were leaving the next morning, he refused to come and see us :).

After clearing up the last remains of trout from the Himalayan Trout House (of course Ashit was a big contributor to this) we were ready to leave. We had to wait till about 2 for the pouring rain to stop, for no good. Finally we decided to move or else we won't make it to office the next morning. And battling the pouring rain, the countless trucks and the dimming light, we managed to get back to Delhi at about 2 at night. A lovely weekend to a place I'd like to go again sometime.

Wednesday, March 4, 2009

Musings and Gyan!

I started work on a new geographic engine for geo-coding in C. Whatever I try to do, there will always be an overhead in Java. For example, for the routing implementation, I need a good priority queue implementation. I still have no idea how to get around the fact that only call by value is allowed. So if a change has to be made at the lower end of a heap (Say Fibonacci or Radix or a combo) what's the best way to update the structure. Effectively, the computation time goes for a toss and its best to use plain old binary heap.

So its time to go back to the basics again and write a good and portable C code. Why C and not C++... Somehow I haven't been able to convince myself on the advantages of using C++. Ok, it allows object oriented programming and all the fancy terms that come along with it... But unless if I write a good C code, and expose the APIs well, do I really need C++?

say a simple class for a hello world...

class test {
private:
int c;
public:
void setC(int &a) {
c = a;
}
int getC() {
return c;
}
};

same in C will be:

typedef struct _test {
int c;
}test;

void
setC(test *t,int *a) {
t->c = *a;
}

int getC(test *t) {
return t->c;
}

Gives a pretty good picture of what's the problem using C.

1) Access permission to the structure variables
2) The boon and bane, the ring of power, the STAR *; also known as the POINTER!

The idea is to get more people to program easily, reducing dependence on individual ability. The cost, whatever one may say, will be the performance. I can probably go on to include programming languages like Java into this, but I don't want to make things ugly.

Now, how do I extend this to the web...
1) CGI
2) Scripting languages

The good old CGI has always been there. It has its pros and cons. The advent and the growing popularity of languages like PHP and Python has given the second method a whole new light. And throw into that software like SWIG and BOOST... voila! Its a killer combo.

I tried this method for the first time for a small code we needed for image geo-tagging. So on top of Exiv2, I wrote a wrapper in C++ (Hehe...) and using Swig, I extended it to Python and PHP5. It works fabulously!

I have heard that JNI has a overhead problems etc. But if the C implementation is great, it might just offset that.

Then comes the cost of deployment.
Case 1: Shared/Dedicated hosting with no capital expenditure

A Linux server with Apache2 and mod-php5 or mod-python is the cheapest option! If Linux isn't an issue (If it is you're in the wrong blog!)

Case 2: Owned Servers

Besides the obvious fact stated above, lets go a little deeper into the enterprise architecture.

A typical Java hosting model will have a web server balancing single or multiple instances of application servers (Eg: JBoss or Glassfish) or servlet containers (Eg: Tomcat) using mod_jk or mod_proxy. So the web server (Apache2) will basically act as a load balancer and static content dispatcher. Great Setup! But investment cost is reasonable. A stand-by fail over at the web server level is quite easy. Setup requires and tuning requires some expertise, which isn't to hard to get either. Session management can be taken care of at the application server level leaving apache to communicate plain and simple using ajp.

Lets now try to see a PHP scenario.

One apache2 server with mod_php5. Can I split it here without losing the benefits of Apache modules? I don't think so. So This is the setup required. Next, site gets popular and load increases. I have to go for a load balancing solution. Either I do it using a proxy like Squid or a hardware load balancer. The latter is always a better option, but cost may be prohibitive. Session handling will need a change now. As per me, the best solution is to change PHP session handling from the local file to a Memcache instance.

This leaves us with the following setup:
1) Load balancer
2) Apache2/PHP5 servers
3) Memcache server

The PHP can be replaced easily with Python or Perl. And coming back to the start of the discussion, allows well written code in C/C++ to execute fast.


So the cost definitely gets higher from a setup point of view, compared to the first setup. But the most expensive component will be a one time cost. The former allows scaling to a good level, beyond which again the load balancer will come into play, though much later. The servers, however, will usually be much more powerful (Coming totally out of experience).

So my suggestion, have developers, have some money to through initially (from a good investment etc), go for the second setup. Makes much more sense.

So that's where I'm heading. I'll write this code with the objective being lowest possible response time. And then will think of extending it to Python, PHP and maybe even Java (JNI).

Saturday, February 28, 2009

Retrospective

I was going through Slashdot and saw an article (http://ask.slashdot.org/article.pl?sid=09/02/28/037256). It starts with the author saying that He's been using GNU-Linux for the last 10 years. I realized that I have been using it for more than that... about 13 years now!

I've seen it come up from a difficult to install, geeky (the list goes on) to what it is today; a competitive desktop OS! I've gone through so many distros, spoilt so many working systems and played around with some FOSS. I got particularly interested in GAIM (now Pidgin) through a senior and friend at college.

So where am I finally when it comes to it. I'm a big supporter and consumer of FOSS, but my contribution has been minimal. Recently I reported a couple of bugs in the Geos library while building it on the Sun Studio tool chain. So basically that's the nature of the contribution I've made.

I do feel that the amount I'm taking from this unlimited pool, can not match the few drops I'm putting in. But I guess a lot of people are pouring in buckets one after another. That's what allows people like me to get away and still be at peace.

There was a poem I composed long time back... I'll post it some other time. It ended something like this:

Then a flash ripped me apart,
As I tried to run,
With guilt still within my heart...

Lets see how far the Icarus flies!

Friday, February 20, 2009

Way ahead

So I've finally decided what I'm going to do without looking back. I'm going to apply for MS/PhD for next year fall. I've started research work I want to get out before I apply. Its basically in the two areas I have been working most: Geo-coding addresses and Path finding.

Geo-coding for a country like India is beyond the realms of mathematical complexities. Yet, its something that must be done. I'm taking a different approach this time around. I'm working on getting parameters based on both textual match and geographic proximity to find a match. This approach, I believe, can work for a much wider array of input types.

The other, path finding, is a problem that has been worked on a lot by a lot of well renowned people. My idea is to develop an optimal parallel algorithm for a transportation (road) network.

Good luck to me!!!

Thursday, February 19, 2009

Bad Times

I was chatting with a friend from college who works in a mid-sized product development company. In fact, there was a time when I was thinking of joining the same. So she tells me that the CEO made an appearance in a Maruti 800 instead of his Accord and announced that they were in a really bad financial position. Basically they may not be able to pay the employees.

About time I got more serious about what I want to do! Always been in this dilemma, and it keeps coming back. Go ahead with academics and research or stay in the corporate world. I am and know can continue to do well in the corporate world. But not sure whether that's what I want. At times it seems like bliss. But, isn't ignorance bliss?

Guess I'll finally put some serious thought into it and make the final call... Till then, ciao!

Wednesday, February 18, 2009

Trying to clean up

I wrote about the mess I had gotten myself into. Almost out of it now. But one big learning out of this is however big a brand may be, go ahead only after root level testing.

Thought of a couple of ways to make the direction search faster and use lesser memory. Going with the good old C file approach. Think it will work. On paper looks great... Reducing the graph size by about 6 times. Speed figures I will get once I analyze the algorithm. Got to look for a parallel algorithm for routing. I was thinking that if a PND with a 500 Mhz processor was able to calculate routes so fast, I should be able to do much better. Lets see how it goes!

Monday, February 16, 2009

In a mess

The first time I took such a bad decision... I convinced the top to buy me a couple of Sun UltraSparc T5120 machines for our maps portal . All was well and rosy. I couldn't stop blabbering about these monstrous machines. Post setup and deployment, it goes into the testing stage and WHOA! Search times have gone low by at-least 10 times!

Traced it down to the fact that Lucene (one of the components I use in the search engine) does not run well (?) on UltraSparc platform! I split the index and used the ParallelSearcher on it. Result was still the same. No idea how I'm gonna go about it. For now, I've put this part on an AMD Opteron machine and its doing well. But I have no answer, except a sorry to give to my management. In times like this, a mistake as big as this might just cost me a lot more than it would have otherwise.

Monday, January 19, 2009

Bugged!

I was in Chennai on Friday for some work. Had a taste of the PSU employee's total lack of another person's time, energy and money. I flew down from Delhi after fixing everything up only to find that the persons concerned were on leave. Anyways, I ended the day by cursing them on the flight back home.

If this is how some of the biggest PSUs run in India, it is not surprising that it is difficult to have a public-private partnership! To think of it, if a small player or a startup tries to do this, they will be mightily #@$%*( !

Tuesday, January 13, 2009

Bitten by Bytes

Long time I haven't written anything. In fact, I forgot about the blog completely. All I've been trying to do is build the Geos library on Solaris platform. All I get is errors. It is suggested that I use Sun-Studio compiler, but that's what is causing the trouble. Lets see if something works out. Guess this will be my first 'Contribution' to the Open Source Community after a long time of working on OSS. I just hope that I don't face the same trouble with Proj or PostGIS!