Archive for June, 2008

The SQL killer

Monday, June 30th, 2008

There are things I like about SQL. It works, it is predictable, it has been around forever ( = stable). But there are things that I don’t like about it. It is too structured, it takes a lot of work to make it efficient when you have a lot of data, and you often need to jump through hoops to make what seem like small changes and still keep your queries efficient.

SQL servers are not good at weeding through a large set of data to regroup or reaggregate them. It has two major limitation. The first is that it is heavily dependent on indexes to speed up its reads, the unfortunate side effects of indexes is the performance cost to writes and the large space they take. Indexes are rigid and your queries and indexes need to match. A single query that doesn’t map well to any of your pre-conceived indexes could cause a costly table scan.

What makes the index problem harder to solve is SQL’s second limitation, which is that it is by architecture hard to distribute. SQL assumes that all the data exists in one place (the table or index) and breaking up the processing is unnatural to it. Unless of course you take on the brunt of the work and cluster your data, but that’s not usually an easy task and developers chose to only do it when they really have to.

I don’t think relational databases can go further than they already have. There needs to be a different solution that can achieve at least two goals:The data processing should be distributable so that your are not limited by the size of your biggest machine, and the ability to query and/or aggregates data along any dimension. Only such a system would be able to mirror the kind of applications that we are building today. Distrubuted, large, more fluid and multidimentional are some of the words that describe today’s web apps.

That is why I was ecstatic when I saw  the CouchDB Project. It is a whole new way of looking at databases. it combines the distributable characteristics of map/reduce style processing, with the flexibility of key/value pair hashes, with the lazy processing of data (what they call views), and ends up with a SQL killer.

I haven’t read enough and experimented too little, so you are better off reading the Introduction and Technical Overview from their site. But if this project wasn’t in pre-alpha, I would have used it immediately at MedHelp. The code is on my machine, and it sounds like I will be having fun with this thing tonight.