Thursday, 3 July 2008

Hadoop beats terabyte sort record

Hadoop has beaten the record for the terabyte sort benchmark, bringing it from 297 seconds to 209. Owen O'Malley wrote the MapReduce program (which by the way has a clever partitioner to ensure the reducer outputs are globally sorted and not just sorted per output partition, which is what the default sort does), and then ran it on 910 nodes on Yahoo!'s cluster. There are more details in Owen's blog post (and there's a link to the benchmark page which has a PDF explaining his program). You can also look at the code in trunk.

Well done Owen and well done Hadoop!

Friday, 20 June 2008

Hadoop Query Languages

If you want a high-level query language for drilling into your huge Hadoop dataset, then you've got some choice:
  • Pig, from Yahoo! and now incubating at Apache, has an imperative language called Pig Latin for performing operations on large data files.
  • Jaql, from IBM and soon to be open sourced, is a declarative query language for JSON data.
  • Hive, from Facebook and soon to become a Hadoop contrib module, is a data warehouse system with a declarative query language that is a hybrid of SQL and Hadoop streaming.
All three projects have different strengths, but there is plenty of scope for collaboration and cross-pollination, particularly in the query language. For example, at the Hadoop Summit in March, Joydeep Sen Sarma of Facebook said that they would be receptive to users who wanted to use Pig Latin or Jaql in Hive. And Kevin Beyer of IBM Research said that Pig and Jaql are converging, and they've had discussions with the Pig team about how to bring them even closer together.

Meanwhile, to learn more I recommend Pig Latin: A Not-So-Foreign Language for Data Processing (by Chris Olston et al), and the slides and videos from the Hadoop Summit.

(And I haven't even included Cascading, from Chris K. Wensel, which, while not a query language per se, is an abstraction built on MapReduce for building data processing flows in Java or Groovy using a plumbing metaphor with constructs such as taps, pipes, and flows. Well worth a look too.)

Friday, 13 June 2008

"The Next Big Thing"

James Hamilton on The Next Big Thing:
Storing blobs in the sky is fine but pretty reproducible by any competitor. Storing structured data as well as blobs is considerably more interesting but what has even more lasting business value is the storing data in the cloud AND providing a programming platform for multi-thousand node data analysis. Almost every reasonable business on the planet has a complex set of dimensions that need to be optimized.

I think we're only beginning to see interesting data processing being done in the cloud - there's much more to come.

Friday, 30 May 2008

Bluetooth Castle

Today I visited Raglan Castle in Monmouthshire with my family. Cadw, the government body that manages the castle, were running a trial to deliver audio files to visitors' mobile phones using Bluetooth. As I walked through the entrance I simply made my phone discoverable, waited a few seconds for the MP3 to download, then started listening to a guided tour of the castle. It's a great use of the technology: it just worked.

The talk only lasted a few minutes, so we had plenty of time afterwards to run around the ruins.

A couple of technical questions that sprang to mind:
  1. How would you set up a server to push files over Bluetooth? (There are loads of ways you could use this - maps of the local area at transport hubs, sharing the schedule at conferences, random photo of the day at home, etc.)
  2. Can you make audio files navigable? That is, make it easy to go to the part of audio file that is about a given exhibit by typing in the exhibit's number? (This problem reminds me of Cliff Schmidt's talk about the Talking Book Device at ApacheCon EU 2008.)

Tuesday, 22 April 2008

Portable Cloud Computing

Last July I asked "Why are there no Amazon S3/EC2 competitors?", lamenting the lack of competition in the utility or cloud computing market and the implications for disaster recovery. Closely tied to disaster recover is portability -- the ability to switch between different utility computing providers as easily as I switch electricity suppliers. (OK, that can be a pain, at least in the UK, but it doesn't require rewiring my house.)

It's useful to compare Amazon Web Services with Google's recently launched App Engine in these terms. In some sense they compete, but they are strikingly different offerings. Amazon's services are bottom up: "here's a CPU, now install your own software". Google's is top down: "write some code to these APIs and we'll scale it for you". There's no way I can take my EC2 services and run them on App Engine. But I can do the reverse -- sort of -- thanks to AppDrop.

And that's the thing. What I would love is a utility service from different providers that I can switch between. That's one of the meanings of "utility", after all. (Moving lots of data between providers may make this difficult or expensive to do in practice -- "data inertia" -- but that's not a reason not to have the capability.)

There are at least two ways this can happen. One's the AppDrop approach -- emulate Google's API and provide an alternative place to run applications, in this case it's EC2.

However, there's another way: build "standard, non-proprietary cloud APIs with open-source implementations", as Doug Cutting says on his blog post Cloud: commodity or proprietary? In this world, applications use a common API, and host with whichever provider they fancy. Bottom up offerings like Amazon's facilitate this approach: the underlying platforms may differ, but it's easy to run your application on the provided platform -- for example, by building an Amazon AMI. Google's top down approach is not so amenable, application developers are locked-in to the APIs Google provide. (Of course, Google may open this platform up more over time, but it remains to be seen if they will open it up to the extent of being able to run arbitrary executables.)

As Doug notes, Hadoop is providing a lot of the building blocks for building cloud services: filesystem (HDFS), database (HBase), computation (MapReduce), coordination (ZooKeeper). And here, perhaps, is where the two approaches may meet -- AppDrop could be backed by HBase (or indeed Hypertable), or HBase (and Hypertable) could have an open API which your application could use directly.

Rails or Django on HBase, anyone?

Monday, 14 April 2008

Hadoop at ApacheCon Europe

On Friday in Amsterdam there was a lot of Hadoop on the menu at ApacheCon. I kicked it off at 9am with A Tour of Apache Hadoop, Owen O'Malley followed with Programming with Hadoop’s Map/Reduce, and Allen Wittenauer finished off after lunch with Deploying Grid Services using Apache Hadoop. Find the slides on the Hadoop Presentations page of the wiki. I've also embedded mine below.

I only saw half of Allen's talk as I had to catch my plane, but I was there long enough to see his interesting choice of HDFS users... :)



Also at ApacheCon I enjoyed meeting the Mahout people (Grant, Karl, Isabel and Erik), seeing Cliff Schmidt's keynote, and generally meeting interesting people.

Sunday, 30 March 2008

Turn off the lights when you're not using them, please

One of the things that struck me about this week's new Amazon EC2 features was the pricing model for Elastic IP addresses:
$0.01 per hour when not mapped to a running instance
The idea is to encourage people to stop hogging public IP addresses, which are a limited resource, when they don't need them.

I think one way of viewing EC2 - and the other Amazon utility services - is as a way of putting very fine-grained costs on various computing operations. So will such a pricing model drive us to minimise the computing resources we use to solve a particular problem? My hope is that making computing costs more transparent will at least make us think about what we're using more, in the way metered electricity makes (some of) us think twice about leaving the lights on. Perhaps we'll even start talking about optimizing for monetary cost or energy usage rather than purely raw speed?