Friday, 1 May 2009

"Hadoop: The Definitive Guide" Coming Soon

After a busy couple of months I've finished the writing for "Hadoop: The Definitive Guide". It's now going through the production process at O'Reilly.

You can pre-order it on Amazon and O'Reilly. You can also get the Rough Cuts version from O'Reilly to read today, although it hasn't yet been refreshed with my latest draft (I hope that will happen in the next few days).

Here's the final chapter listing. Readers of earlier drafts will notice that the number of chapters has grown: this is because the elephantine MapReduce chapter has been split into three (chapters 6, 7, and 8) to make things more digestible.

  1. Meet Hadoop
  2. MapReduce
  3. The Hadoop Distributed Filesystem
  4. Hadoop I/O
  5. Developing a MapReduce Application
  6. How MapReduce Works
  7. MapReduce Types and Formats
  8. MapReduce Features
  9. Setting Up a Hadoop Cluster
  10. Administering Hadoop
  11. Pig
  12. HBase
  13. ZooKeeper
  14. Case Studies
The writing's done but I still have to package up the example code. I'll be doing this soon, and it will appear on the book's website.

Tuesday, 27 January 2009

Draft Pig Chapter

A couple of quick updates on the Hadoop book I'm writing. The Pig chapter is now available on Safari. It still has a few holes, but I'd love to hear feedback on it.

Also included is a Hadoop case study from Last.fm. Thanks to Adrian Woodhead and Marc de Palol for writing it.

Thursday, 20 November 2008

Hadoop Developer Zeitgeist

The Cloudera team have just released a website which has a few reports on various Hadoop development metrics. I like the Most Watched Open Jira Issues, as it gives a good summary of what Hadoop Core developers are thinking about.

Personally, I can't wait for the new MapReduce API (HADOOP-1230), which is currently the third most watched issue.

Thursday, 16 October 2008

Cloudera

I'm pleased to announce that I've joined Cloudera, a new startup providing support for Hadoop. Amr Awadallah (who's one of the founders) has got more details in his blog post.

Tuesday, 16 September 2008

Hadoop: The Definitive Guide

Update: Fixed feedback link.

The Rough Cut of Hadoop: The Definitive Guide is now up on O'Reilly's site. There are a few chapters available already, at various stages of completion. Remember, it's still pretty rough. I'd love to hear any suggestions for improvements that you may have though. You can submit feedback from Safari where the book is hosted. As the Rough Cuts FAQ explains, I'd like feedback on missing topics, if something is not understandable, and technical mistakes.

Now I just need to go and write the rest of it...

Thursday, 4 September 2008

Hosting Large Public Datasets on Amazon S3

Update: I just thought of a quick and dirty way of doing this: just store your content on an extra large EC2 instance (holds up to 1690GB) and make the image public. Anyone can access it using their EC2 account, you just get charged for hosting the image.

There's a great deal of interest in large, publicly available datasets (see, for example, this thread from theinfo.org), but for very large datasets it is still expensive to provide the bandwidth to distribute them. Imagine if you could get your hands on the data from a large web crawl, the kind of thing that the Internet Archive produces. I'm sure people would discover some interesting things from it.

Amazon S3 is an obvious choice for storing data for public consumption, but while the cost for storage may be reasonable, the cost for transfer can be crippling since the cost is not under the control of the data provider, being incurred for each transfer (which is initiated by the user).

For example, consider a 1TB dataset. With storage running at $0.15 per GB per month this works out at around $150 per month to host. With transfer costs costing $0.18 per GB, this dataset costs around $180 for each transfer out of Amazon! It's not surprising large datasets are not publicly hosted on S3.

However, transferring data between S3 and EC2 is free, so could we limit transfers from S3 so they are only possible to EC2? You (or anyone else) could run an analysis on EC2 (using Hadoop, say) and only pay for the EC2 time. Or you could transfer it out of EC2 at your own expense. S3 doesn't support this option directly, but it is possible to emulate it with a bit of code.

The idea (suggested by Doug Cutting) is to make objects private on S3 to restrict access generally, then run a proxy on EC2 that is authorized to access the objects. The proxy only accepts connections from within EC2: any client that is outside Amazon's cloud is firewalled out. This combination ensures only EC2 instances can access the S3 objects, thus removing any bandwidth costs.

Implementation

I've written such a proxy. It's a Java servlet that uses the JetS3t library to add the correct Amazon S3 Authorization HTTP header to gain access to the owner's objects on S3. If the proxy is running on the EC2 instance with hostname ec2-67-202-43-67.compute-1.amazonaws.com, for example, then a request for
http://ec2-67-202-43-67.compute-1.amazonaws.com/bucket/object
is proxied to the protected object at
http://s3.amazonaws.com/bucket/object
To ensure that only clients on EC2 can get access to the proxy I set up an EC2 security group (which limits access to port 80):
ec2-add-group ec2-private-subnet -d "Group for all Amazon EC2 instances."
ec2-authorize ec2-private-subnet -p 80 -s 10.0.0.0/8
Then by launching the proxy in this group, only machines on EC2 can connect. (Initially, I thought I had to add public IP addresses to the group -- which, incidentally, I found in this forum posting -- but this is not necessary as the public DNS name of an EC2 instance resolves to the private IP address within EC2.) The AWS credentials to gain access to the S3 objects are passed in the user data, along with the hostname of S3:
ec2-run-instances -k gsg-keypair -g ec2-private-subnet \
-d "<aws_access_key> <aws_secret_key> s3.amazonaws.com" ami-fffd1996
This AMI (ID ami-fffd1996) is publicly available, so anyone can use it by using the commands shown here. (The code is available here, under an Apache 2.0 license, but you don't need this if you only intend to run or use a proxy.)

Demo

Here's a resource on S3 that is protected: http://s3.amazonaws.com/tiling/private.txt. When you try to retrieve it you get an authorization error:
% curl http://s3.amazonaws.com/tiling/private.txt
<?xml version="1.0" encoding="UTF-8"?>
<Error>
<Code>AccessDenied</Code>
<Message>Access Denied</Message>
<RequestId>57E370CDDD9FE044</RequestId>
<HostId>dA+9II1dYAjPE5aNsnRxhVoQ5qy3KCa6frkLg3SyTwzP3i2SQNCU534/v8NXXEnN</HostId>
</Error>
With a proxy running, I still can't retrieve the resource via the proxy from outside EC2. It just times out due to the firewall rule:
% curl http://ec2-67-202-56-11.compute-1.amazonaws.com/tiling/private.txt
curl: (7) couldn't connect to host
But it does works from an EC2 machine (any EC2 machine):
% curl http://ec2-67-202-56-11.compute-1.amazonaws.com/tiling/private.txt
secret

Conclusion

By running a proxy on EC2, at 10 cents per hour (small instance) - or $72 a month, you can allow folks using EC2 to access your data on S3 for free. While running the proxy is not free, it is a fixed cost that might be acceptable to some organizations, particularly those that have an interest in making data publicly available (but can't stomach large bandwidth costs).

A few questions:
  • Is this useful?
  • Is there a better way of doing it?
  • Can we have this built into S3 (please, Amazon)?

Saturday, 23 August 2008

Elastic Hadoop Clusters with Amazon's Elastic Block Store

I gave a talk on Tuesday at the first Hadoop User Group UK about Hadoop and Amazon Web services - how and why you can run Hadoop with AWS. I mentioned how integrating Hadoop with Amazon's "Persistent local storage", which Werner Vogels had pre-announced in April, would be a great feature to have to enable truly elastic Hadoop clusters that you could stop and start on demand.

Well, the very next day Amazon launched this service, called Elastic Block Store (EBS). So in this post I thought I'd sketch out how an elastic Hadoop might work.

A bit of background. Currently there are three main ways to use Hadoop with AWS:

1. MapReduce with S3 source and sink


In this set up, the data resides on S3, and the MapReduce daemons run on a temporary EC2 cluster for the duration of the job run. This works, and is especially convenient if you've already store your data on S3, but you don't get any data locality. Data locality is what enables the magic of MapReduce to work efficiently - the computation is scheduled to run on the machine where the data is stored, so you get huge savings in not having to ship terabytes of data around the network. EC2 does not share nodes with S3 storage, in fact they are often in different data centres, so performance is nowhere near as good as a regular Hadoop cluster where the data in stored in HDFS (see 3. below).

It's not all doom and gloom, as the bandwidth between EC2 and S3 is actually pretty good, as Rightscale found when they did some measurements.

2. MapReduce from S3 with HDFS staging


Data is stored on S3 but copied to a temporary HDFS cluster running on EC2. This is just a variation of the previous set-up, which is good if you want to run several jobs against the same input data. You save by only copying the data across the network once, but you pay a little more due to HDFS replication.

The bottleneck is still copying the data out of S3. (Copying the results back into S3 isn't usually as bad as the output is often an order or two of magnitude smaller than the input.)

3. HDFS on Amazon EC2


Of course, you could just run a Hadoop cluster on EC2 and store your data there (and not in S3). In this scenario, you are committed to running your EC2 cluster long term, which can prove expensive, although the locality is excellent.

These three scenarios demonstrate that you pay for locality. However, there is a gulf between S3 and local disks that EBS fills nicely. EBS does not have the bandwidth of local disks, but it's significantly better than S3. Rightscale again:

The bottom line though is that performance exceeds what we’ve seen for filesystems striped across the four local drives of x-large instances.

Implementing Elastic Hadoop


The main departure from the current Hadoop on EC2 approach is the need to maintain a map from storage volume to node type: i.e. we need to remember which volume is a master volume (storing the namenode's data) and which is a worker volume (storing the datanode's data). It would be nice if you could just start up EC2 instances for all the volumes, and have them figure out which is which, but this might not work as the master needs to be started first so its address can be given to the workers in their user data. (This choreography problem could be solved by introducing ZooKeeper, but that's another story.) So for a first cut, we could simply keep two files (locally, or on S3 or even SimpleDB) called master-volumes, and worker-volumes, which simply list the volume IDs for each node type, one per line.

Assume there is one master running the namenode and jobtracker, and n worker nodes each running a datanode and tasktracker.

To create a new cluster
  1. Create n + 1 volumes.
  2. Create the master-volumes file and write the first volume ID into it.
  3. Create the worker-volumes file and write the remaining volume IDs to it.
  4. Follow the steps for starting a cluster.
To start a cluster
  1. Start the master instance passing it the master-volumes as user data. On startup the instance attaches to the volume it was told to. It then formats the namenode if it isn't already formatted, then starts the namenode, secondary namenode and jobtracker.
  2. Start n worker instances passing it the worker-volumes as user data. On startup each instance attaches to the volume on line i, where i is the ami-launch-index of the instance. Each instance then starts a datanode and tasktacker.
  3. If any worker instances failed to start then launch them again.
To shutdown a cluster
  1. Shutdown the Hadoop cluster daemons.
  2. Detach the EBS volumes.
  3. Shutdown the EC2 instances.
To grow a cluster
  1. Create m new volumes, where m is the size to grow by.
  2. Append the m new volume IDs to the worker-volumes file.
  3. Start m worker instances passing it the worker-volumes as user data. On startup each instance attaches to the volume on line n + i, where i is the ami-launch-index of the instance. Each instance then starts a datanode and tasktacker.
Future enhancements: attach multiple volumes for performance/storage growth on the namenode, or resilience on the namenode; integrate the secondary namenode backup facility with EBS snapshots to S3; provide tools for managing the worker-volumes file (for example, integrating with datanode decommissioning).

Building this would be a great project to work on - I hope someone does it!