Python, Web Tech

BigTable and Why it Changes Everything

Background

For the last couple of weeks I’ve been playing with Google App Engine.

In case you’ve been living in a cave for the last month; App Engine is a mostly complete, sandboxed, Python 2.5 environment with a WSGI web server and a very interesting Datastore API.

Production Like Dev Environment

App Engine comes in two flavours. Firstly, there is a development server which contains everything you need to run app engine code on your desktop (mac, linux, or windows); it includes the datastore, a web server and a rudimentary web interface for the datastore.

Simple to Deploy

The development server comes with a ‘deploy’ feature that lets you deploy, with pretty much one click, your whole application to Google’s infrastructure.

What’s so cool about it?

Firstly, ‘No Assembly Required’. That means no fiddling with database parameters, creating databases, users, permissions, no installing updates, patches, libraries, no configuring web servers, firewalls. Nothing. This, in itself, is a HUGE time saver.

Secondly, it’s Python, and it’s WSGI. That means you can use Django, web.py, Pylons, or just about Python framework with it. If yours doesn’t work out of the box, then it shouldn’t be hard to modify to fit.

However, in keeping with, “no assembly required”, App Engine comes with it’s own minimalist webapp framework.

Now webapp is actually very nice to use. It uses the elegant and simple WebOb library, and while it doesn’t impose any module / directory convention, you can easily adopt whatever structure you like.

All of this is cool, but it’s not what’s really cool.

BigTable

BigTable and the Datastore API are awesome.

A BigTable is a sparse, distributed, persistent multi-dimensional sorted map.

Ostensibly, A BigTable is an alternative to a relational database. Google engineers designed BigTable out of necessity as a mechanism to facilitate the storage, indexing, and retrieval of petabytes of arbitrary data across multiple distributed data centers.

We all use BigTables every day through Google search, Gmail, Google Maps, Google Earth, etc.

Datastore API

The AppEngine Datastore API exposes BigTable to our Python applications as an object persistence API. It does this in such a simple way that it almost escapes you just how beautiful this is.

Take a look at this model definition:

from google.appengine.ext import db
class Pet(db.Model):
    name = db.StringProperty(required=True)
    type = db.StringProperty(required=True, choices=set(["cat", "dog", "bird"]))
    birthdate = db.DateProperty()

That’s it. No Hibernate mapping files, no database connection properties. Just a class definition.

If you write some code to query the datastore on a property, then the indexes are defined and created automatically for you at deployment time.

Expando: Dynamic persistent objects

App Engine comes with dynamic persistent models called expando models.

These really begin to demonstrate the flexibility of a BigTable over a relational database. A model that subclasses the expando superclass may have properties dynamically added at run time.

For example:

class Person(db.Expando):
    first_name = db.StringProperty()
    last_name = db.StringProperty()
    hobbies = db.StringListProperty()
p = Person(first_name="Albert", last_name="Johnson")
p.hobbies = ["chess", "travel"]
p.chess_elo_rating = 1350
p.travel_countries_visited = ["Spain", "Italy", "USA", "Brazil"]
p.travel_trip_count = 13

Map, Reduce

Google applications operate on a scale in excess of what most people consider high volume. To manage this, Google engineers developed a programming model called, MapReduce.

Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model…

As I understand it, MapReduce allows Google to use BigTable while taking advantage of the virtually unlimited parallelism provided by their underlying hardware infrastructure; which is famously a vast array of commodity cheap PC’s.

What does this mean for programming?

Putting MapReduce aside for a moment, I believe the the AppEngine datastore API provides a clear indication of the future of data persistence.

In Enterprise IT today, some of the biggest challenges and complexities arise from trying to make fast, reliable distributed and fault tolerant persistent indexed storage from relational databases, and making that available to application developers in a secure and simple way.

I believe that by redesigning persistent object storage form the bottom up, Google have blown away the multi square peg / round hole problems of objects, databases, servers and data centers.

In providing the AppEngine Datastore API, Google have shown us that we can take full advantage of BigTables in our current, everyday, mortal programming ways.

It’s like we’ve been using rocket engines and Google’s just shown us the warp drive!

BigTable and Why it Changes Everything first appeared on jetfar.com.

Bookmark with delicious or Stumbleupon.

11 Comments

speak up

Add your comment below, or trackback from your own site.

Subscribe to these comments.

Be nice. Keep it clean. Stay on topic. No spam.

You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

*Required Fields