No Clean Feed - Stop Internet Censorship in Australia

Rich Atkinson

Rich Atkinson's Personal Blog

BigTable and Why it Changes Everything

Background

For the last couple of weeks I’ve been playing with Google App Engine.

In case you’ve been living in a cave for the last month; App Engine is a mostly complete, sandboxed, Python 2.5 environment with a WSGI web server and a very interesting Datastore API.

Production Like Dev Environment

App Engine comes in two flavours. Firstly, there is a development server which contains everything you need to run app engine code on your desktop (mac, linux, or windows); it includes the datastore, a web server and a rudimentary web interface for the datastore.

Simple to Deploy

The development server comes with a ‘deploy’ feature that lets you deploy, with pretty much one click, your whole application to Google’s infrastructure.

What’s so cool about it?

Firstly, ‘No Assembly Required’. That means no fiddling with database parameters, creating databases, users, permissions, no installing updates, patches, libraries, no configuring web servers, firewalls. Nothing. This, in itself, is a HUGE time saver.

Secondly, it’s Python, and it’s WSGI. That means you can use Django, web.py, Pylons, or just about Python framework with it. If yours doesn’t work out of the box, then it shouldn’t be hard to modify to fit.

However, in keeping with, “no assembly required”, App Engine comes with it’s own minimalist webapp framework.

Now webapp is actually very nice to use. It uses the elegant and simple WebOb library, and while it doesn’t impose any module / directory convention, you can easily adopt whatever structure you like.

All of this is cool, but it’s not what’s really cool.

BigTable

BigTable and the Datastore API are awesome.

A BigTable is a sparse, distributed, persistent multi-dimensional sorted map.

Ostensibly, A BigTable is an alternative to a relational database. Google engineers designed BigTable out of necessity as a mechanism to facilitate the storage, indexing, and retrieval of petabytes of arbitrary data across multiple distributed data centers.

We all use BigTables every day through Google search, Gmail, Google Maps, Google Earth, etc.

Datastore API

The AppEngine Datastore API exposes BigTable to our Python applications as an object persistence API. It does this in such a simple way that it almost escapes you just how beautiful this is.

Take a look at this model definition:

from google.appengine.ext import db
class Pet(db.Model):
    name = db.StringProperty(required=True)
    type = db.StringProperty(required=True, choices=set(["cat", "dog", "bird"]))
    birthdate = db.DateProperty()

That’s it. No Hibernate mapping files, no database connection properties. Just a class definition.

If you write some code to query the datastore on a property, then the indexes are defined and created automatically for you at deployment time.

Expando: Dynamic persistent objects

App Engine comes with dynamic persistent models called expando models.

These really begin to demonstrate the flexibility of a BigTable over a relational database. A model that subclasses the expando superclass may have properties dynamically added at run time.

For example:

class Person(db.Expando):
    first_name = db.StringProperty()
    last_name = db.StringProperty()
    hobbies = db.StringListProperty()
p = Person(first_name="Albert", last_name="Johnson")
p.hobbies = ["chess", "travel"]
p.chess_elo_rating = 1350
p.travel_countries_visited = ["Spain", "Italy", "USA", "Brazil"]
p.travel_trip_count = 13

Map, Reduce

Google applications operate on a scale in excess of what most people consider high volume. To manage this, Google engineers developed a programming model called, MapReduce.

Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model…

As I understand it, MapReduce allows Google to use BigTable while taking advantage of the virtually unlimited parallelism provided by their underlying hardware infrastructure; which is famously a vast array of commodity cheap PC’s.

What does this mean for programming?

Putting MapReduce aside for a moment, I believe the the AppEngine datastore API provides a clear indication of the future of data persistence.

In Enterprise IT today, some of the biggest challenges and complexities arise from trying to make fast, reliable distributed and fault tolerant persistent indexed storage from relational databases, and making that available to application developers in a secure and simple way.

I believe that by redesigning persistent object storage form the bottom up, Google have blown away the multi square peg / round hole problems of objects, databases, servers and data centers.

In providing the AppEngine Datastore API, Google have shown us that we can take full advantage of BigTables in our current, everyday, mortal programming ways.

It’s like we’ve been using rocket engines and Google’s just shown us the warp drive!

BigTable and Why it Changes Everything first appeared on jetfar.com.

Written by Rich Atkinson

July 14, 2008 at 1:25 pm

Posted in Web Tech

12 Responses

Subscribe to comments with RSS.

  1. The object-database mapping using just a class definition – that was done in django first (which has been patched to work with google apps). (or maybe previously somewhere else – but definitely not a new thing).

    My chief concern about using google apps BigTable is lock-in. Still, if you consider this the only warp engine in town, and you have some site that would be unusably slow if moved to a normal relational db, I can see your point.

    carbonninja

    July 14, 2008 at 2:20 pm

  2. @Carbonninja

    What I believe Google provide here that we haven’t seen before is an object model that can be changed anytime, even at run time, without altering any tables. Even Django-evolution (http://code.google.com/p/django-evolution/) can’t offer that.

    And yes, the class definition I think is based on Django’s model API. But what’s so cool here is we don’t need to set up a database, or manage any permissions.

    I agree with you about the lock-in though. I’ll still be using Postgres and mod_wsgi until there is a good open source implementation of Bigtable that I can use in Django.

    Thanks for commenting :-)

    Rich Atkinson

    July 14, 2008 at 2:27 pm

  3. Check out HBase for an open source bigtable. It’s part of the Apache Lucene project. They also have Hadoop, a map/reduce impl, and Nutch, a search engine written on top of it.

    It’s written in Java, but is callable from Python very easily. I use it from TurboGears, but it should be equally usable from Django.

    Dan

    July 14, 2008 at 4:09 pm

  4. Hadoop has an open-source table that works like BigTable, if I understand correctly…

    Ben

    July 14, 2008 at 5:53 pm

  5. You only compare BigTable to RDBMSs, but from your description, it looks much more like a match for an OODB. How does it compare to other OODBs?

    ken

    July 14, 2008 at 6:40 pm

  6. I looked at BigTable. I’m sure it’s fine for most web apps, but the query facilities are very very very limited. BigTable surely isn’t supposed to support data warehousing / analytics scenarios.

    fauigerzigerk

    July 14, 2008 at 8:21 pm

  7. @Dan
    Thanks! I didn’t realise Hbase was ready for mass use yet. I’ll definitely look into it. Cheers.

    @Ben
    Hadoop is related, but slightly different. It’s a Java implementation of mapreduce.

    @Ken
    You are correct, conceptually this is closer to other types of OODB, such as Cache or Zope.

    However I’ve never been quite so captivated by any previous OODB implementation.

    I expect that it’s the zero configuration aspect that caught my imagination. I’d like to see a lot more of this in the future.

    @fauigerzigerk
    Limited querying capability is probably the most common complaint of the Datastore API. I’m no expert in analytics or warehousing, but I suspect that this model is just as powerful as an RDBMS, only requires a different approach to fully exploit it’s potential.

    If you look at what Google do with maps, earth, search… I expect it’s implementing your algorithms using the map/reduce principles that makes all the difference. It would be interesting to have a look at some specific examples of perceived limitations, and how they could be approached using map/reduce.

    Thanks all for your comments.

    - Rich

    Rich Atkinson

    July 14, 2008 at 10:21 pm

  8. Well now I’m really suspicious! I’d never heard of Cache, and I’ve only heard bad things about ZODB. While that’s not necessarily bad, it’s not a good sign when a new product only looks good compared to the worst of the old bunch. I wonder how it compares to OODBs people actually like, like db4o, AllegroGraph, or even Versant.

    ken

    July 15, 2008 at 5:12 pm

  9. Hi Ken.
    It would be unfair of me to compare it against any of the above as I don’t have experience with them.

    I’d love to hear from anyone who does?

    Cheers
    - Rich

    Rich Atkinson

    July 15, 2008 at 11:09 pm

  10. Sure bigtable is a different way to look at RDBMS’s but I wouldn’t yet go as far as to say it changes everything. It is one more option. But a substantial portion of the current enterprise application portfolio is unlikely to be heading the bigtable way since many of them may not be able to comfortable with the limited transactional capabilities of Big Table.

    Dhananjay Nene

    July 16, 2008 at 7:33 pm

  11. [...] BigTable and Why it Changes Everything [...]

  12. I think you over emphasized the title…

    Rich: I don’t think I did. Google have redefined application deployment with app engine. Not only through their simple one line deploy tool (easy to do with fabric, capistrano, ant, maven, etc) but also through the zero configuration database.

    Just define your models and off you go. Have you seen that before?

    Pedram

    November 15, 2008 at 9:39 am


Comments are closed.