Wednesday, March 4, 2009

Musings and Gyan!

I started work on a new geographic engine for geo-coding in C. Whatever I try to do, there will always be an overhead in Java. For example, for the routing implementation, I need a good priority queue implementation. I still have no idea how to get around the fact that only call by value is allowed. So if a change has to be made at the lower end of a heap (Say Fibonacci or Radix or a combo) what's the best way to update the structure. Effectively, the computation time goes for a toss and its best to use plain old binary heap.

So its time to go back to the basics again and write a good and portable C code. Why C and not C++... Somehow I haven't been able to convince myself on the advantages of using C++. Ok, it allows object oriented programming and all the fancy terms that come along with it... But unless if I write a good C code, and expose the APIs well, do I really need C++?

say a simple class for a hello world...

class test {
private:
int c;
public:
void setC(int &a) {
c = a;
}
int getC() {
return c;
}
};

same in C will be:

typedef struct _test {
int c;
}test;

void
setC(test *t,int *a) {
t->c = *a;
}

int getC(test *t) {
return t->c;
}

Gives a pretty good picture of what's the problem using C.

1) Access permission to the structure variables
2) The boon and bane, the ring of power, the STAR *; also known as the POINTER!

The idea is to get more people to program easily, reducing dependence on individual ability. The cost, whatever one may say, will be the performance. I can probably go on to include programming languages like Java into this, but I don't want to make things ugly.

Now, how do I extend this to the web...
1) CGI
2) Scripting languages

The good old CGI has always been there. It has its pros and cons. The advent and the growing popularity of languages like PHP and Python has given the second method a whole new light. And throw into that software like SWIG and BOOST... voila! Its a killer combo.

I tried this method for the first time for a small code we needed for image geo-tagging. So on top of Exiv2, I wrote a wrapper in C++ (Hehe...) and using Swig, I extended it to Python and PHP5. It works fabulously!

I have heard that JNI has a overhead problems etc. But if the C implementation is great, it might just offset that.

Then comes the cost of deployment.
Case 1: Shared/Dedicated hosting with no capital expenditure

A Linux server with Apache2 and mod-php5 or mod-python is the cheapest option! If Linux isn't an issue (If it is you're in the wrong blog!)

Case 2: Owned Servers

Besides the obvious fact stated above, lets go a little deeper into the enterprise architecture.

A typical Java hosting model will have a web server balancing single or multiple instances of application servers (Eg: JBoss or Glassfish) or servlet containers (Eg: Tomcat) using mod_jk or mod_proxy. So the web server (Apache2) will basically act as a load balancer and static content dispatcher. Great Setup! But investment cost is reasonable. A stand-by fail over at the web server level is quite easy. Setup requires and tuning requires some expertise, which isn't to hard to get either. Session management can be taken care of at the application server level leaving apache to communicate plain and simple using ajp.

Lets now try to see a PHP scenario.

One apache2 server with mod_php5. Can I split it here without losing the benefits of Apache modules? I don't think so. So This is the setup required. Next, site gets popular and load increases. I have to go for a load balancing solution. Either I do it using a proxy like Squid or a hardware load balancer. The latter is always a better option, but cost may be prohibitive. Session handling will need a change now. As per me, the best solution is to change PHP session handling from the local file to a Memcache instance.

This leaves us with the following setup:
1) Load balancer
2) Apache2/PHP5 servers
3) Memcache server

The PHP can be replaced easily with Python or Perl. And coming back to the start of the discussion, allows well written code in C/C++ to execute fast.


So the cost definitely gets higher from a setup point of view, compared to the first setup. But the most expensive component will be a one time cost. The former allows scaling to a good level, beyond which again the load balancer will come into play, though much later. The servers, however, will usually be much more powerful (Coming totally out of experience).

So my suggestion, have developers, have some money to through initially (from a good investment etc), go for the second setup. Makes much more sense.

So that's where I'm heading. I'll write this code with the objective being lowest possible response time. And then will think of extending it to Python, PHP and maybe even Java (JNI).