Problems worthy of attack prove their worth by hitting back. —Piet Hein

Thursday, 10 May 2012

Volcanoes!

I've just finished reading "Super Volcano: The Ticking Time Bomb Beneath Yellowstone National Park" by Greg Breinin. Despite the hyperbolic title, it's a really good introduction to the subject. Actually, the title is entirely appropriate, since the previous Yellowstone eruption around 600,000 years ago was one thousand times as powerful as the 1980 Mount St. Helens eruption. And it's likely to erupt again, but no one knows when.

We've been on a bit of a volcano tour recently. First we visited Lassen Volcanic National Park in October (climbing the Cinder Cone was a highlight), and we stopped in on Mount St. Helens visitor center on our way to Seattle last month. Yesterday we ventured into the Yellowstone caldera (the bit that blew out in the last eruption).

Before reading the book I hadn't appreciated how recent our understanding of Yellowstone's geology is. It was only in the 1960s that scientists combined new empirical data about the ages of different rock formations in the park with the then emerging theory of plate tectonics. One of the scientists was Robert Christiansen of the U.S. Geological Survey, who, with Richard Blank, collected samples from all over Yellowstone and pieced together the puzzle of how Yellowstone formed. (He also wrote the definitive account of Yellowstone's geology in 2001.)

They realized that the series of calderas between Oregon and Wyoming were all eruptions caused by what is now known as the Yellowstone hotspot over the last 16 million years. The continental plate is moving south west, which makes the newer volcanoes appear in the north east.

This diagram from Wikipedia summarizes it nicely:


Saturday, 4 June 2011

What's new in Apache Whirr 0.5.0-incubating

Apache Whirr 0.5.0-incubating is now available. Whirr is a library and command line interface for running distributed services like Apache Hadoop in the cloud. Note that Whirr is currently undergoing Incubation at the Apache Software Foundation, which means that, in particular, the project has yet to be
fully endorsed by the ASF. Please read the full disclaimer.

In this release the Whirr development team have added many new features while still making the core more solid. This post covers some of the more important changes. The full list can be found in the release notes.

Improving the new user experience

Orchestrating multiple services on cloud instances is a challenge to make simple, and Whirr has sometimes been a little fiddly to get running. SSH settings, in particular, have been a common sticking point with new users. The new Whirr in 5 Minutes guide walks through the minimum number of commands you need to type to get a simple 3-node ZooKeeper cluster running in a few minutes. From there you can move on to the Quick Start Guide and the Configuration Guide.

The sample configurations in the recipes directory in the distribution contain useful settings for running the services on a variety of cloud providers. Users are always encouraged to share their working configurations with the community.

New services

Elastic Search and Voldemort have been added to the roster of services that come with Whirr. This brings the total to six; adding to Apache Cassandra, Apache Hadoop, Apache HBase, and Apache ZooKeeper.

API improvements

Whirr is still a young project so it is not surprising that its API is rapidly evolving. In WHIRR-245, the demarcation between the user API (for users who control Whirr clusters from Java) and the service API (for developers writing new Whirr services) was clarified. The user API can be found in the org.apache.whirr package; whereas the service API is in org.apache.whirr.service.

You can find out more about writing Whirr services in this presentation (PDF).

The firewall API that service writers use to open ports for services was simplified and made more powerful in WHIRR-275.

Overriding scripts

This feature was actually introduced in Whirr 0.4.0-incubating, but it's useful enough to mention here. In older versions of Whirr, if you wanted to make a modification to the scripts that run on cloud instances - to tweak some settings, for instance - you would have to upload your modifications (as well as all the other scripts) to a publicly available web server (Amazon S3 was a common choice), then point Whirr at the new location. Not particularly difficult, but a big enough barrier to discourage users from trying it.

The new approach is to push scripts to nodes from the launching machine, so you can just edit them locally before launch. Full instructions are covered in the FAQ.

Running scripts on nodes

In 0.5.0 the scripts that run on cloud instances have been broken up to be more fine-grained, so many services have individual start and stop scripts (WHIRR-266). Combined with the ability to run scripts on sets of nodes in the cluster (by ID or role), users now have more control of the cluster once it has launched (WHIRR-173). Try running whirr run-script at the command line to use this feature. There's a contrib script to run the Yahoo! Cloud Serving Benchmark (YCSB) against an HBase cluster, which takes advantage of the run-script command (WHIRR-287).

Also useful is WHIRR-291, which allows you to launch "blank" nodes with no services running on them (in a "noop" role), and then, with whirr run-script, run arbitrary scripts on them to bring them into the state you want.

Custom service builds

Developers who work on services supported in Whirr will find the ability to push a custom build to a cluster very useful for testing (WHIRR-220). For example, if you are working on a ZooKeeper feature, you can build a ZooKeeper tarball with your new feature, then launch a cluster that uses this tarball by specifying whirr.zookeeper.tarball.url as a local file:// URL pointing to your tarball. Whirr will push the tarball to a temporary blob store container, then each node will download from there.

I used a variation of this feature to try out a nightly Hadoop 0.22 build on a small Whirr cluster. In this case the tarball URL is not a local file, so Whirr doesn't copy the tarball to a blob store since it is already accessible from the cloud.

Service improvements

Whirr is only able to exist because of the powerful abstraction that jclouds provides for interacting with cloud providers. A great example of this power is the API that jclouds provides for discovering the hardware capabilities of an instance running on any provider. WHIRR-282 took advantage of the jclouds API to find the number of cores on a node to dynamically configure the number of slots in a Hadoop cluster. Previously, you had to set this manually for each cluster to take full advantage of larger image sizes.

This is just the beginning - there is more work to use memory capabilities to set configuration (WHIRR-229), and to use hardware capabilities generally in services other than Hadoop.

Cluster state storage

In previous releases of Whirr, information about launched instances was stored in a file on the machine that launched the cluster (~/.whirr/<cluster-name>/instances). With WHIRR-288, it's now possible to store this information in a blob store instead (such as Amazon S3, although any jclouds-supported blob store can be used), which is useful if you want to control clusters from multiple machines.

Bring Your Own Nodes

Or just BYON, for short. Many users have requested the ability to deploy to privately owned hardware - and jclouds added this feature in 1.0-beta-9. Whirr now has preliminary support for BYON clusters. In a nutshell, you write a YAML file enumerating the nodes to deploy to - their addresses, access credentials, etc. - then Whirr will start services on them. The nodes just need to have a base OS like Centos or Ubuntu installed. You can find an example BYON configuration in the recipes directory of the download.

BYON is also useful for testing locally by using VMware or VirtualBox to host target nodes.

A hummingbird


Last, but not least, Whirr finally has a logo! Many thanks to Alison Wong, who designed it and donated it to the ASF.

Credits

I would like to thank everyone who helped with the 0.5.0-incubating release. We have a growing community, and we welcome feedback and help from new users and developers. If you'd like to get involved you can start by downloading the new release and joining us on the mailing lists.

What's next?

It's difficult to make firm predictions about the contents of the next release since Whirr is an open source project with many open issues, but the general themes include:
  • Adding more services. In tandem, we want to make it easier to write new services by pushing common patterns into the core (e.g. WHIRR-326 is one example of this).
  • Improving existing services. By making them more flexible, better configured, easier to manage.
  • Adding more cloud providers. The latest release of jclouds supports 30 providers, and we need help testing more of them with Whirr.
  • Implementing services using other configuration management tools, rather than bash scripting. Andrei Savu is working on using Puppet to write new services (WHIRR-255).
  • Supporting elastic clusters, so new nodes can be added to running clusters (WHIRR-214).

Monday, 9 May 2011

Do Donors Choose Local Schools?

DonorsChoose.org is a site where people donate money to school projects. For example, a teacher in Iowa might create a project request for some beanbags to create a reading area for her pupils. Then, via the website, donors can give as much or as little as they like to the project, and once the target is reached DonorsChoose purchase and deliver the beanbags to the school.

DonorsChoose are running a contest. They have opened up their data, and are challenging developers to "make discoveries and build apps that improve education in America".

I thought I'd do a little hack to answer the question "Do donors tend to choose their local schools?"

I wrote a short Python program to calculate the distance between each donor's address (where it was provided) and the address of the school for the project they were donating to. Then, using R, I plotted the following histogram:



It's striking that many donors are local. In fact, in my analysis, one in four donors live within four miles of the school they are donating to, and the median distance is 128 miles. However, there is a long tail reaching to over 5000 miles!

If we use a logarithmic scale for the y-axis (count), then a couple of features jump out. This plot is a scatter plot where counts are bucketed by integer distance.



There is a small peak at around 2500 miles, which is puzzling until you realize that this is the approximate distance between the East Coast and West Coast of the USA, where the majority of the population is located. I'm guessing that this bump corresponds to people who donate to schools of friends and relatives on the other coast.

The other noticeable feature is the significant drop off after 2500 miles. This small number of donations is where the donor or school is located in the non-contiguous states (Alaska and Hawaii), which have only a small fraction of the total population.

How I produced the images

I wrote a Python program to parse the CSV data from DonorsChoose. It reads two data files - the projects file and the donations file. The files are joined by the project ID field, which means we can access the school ZIP code (from the projects file), and the partial ZIP code of the donor (from the donations file). The donor's ZIP code is optional (and was actually only present in 46% of donations, so the results are restricted to this subset of donations). Also, for privacy reasons, only the first 3 digits of the donor's ZIP code are provided by DonorsChoose. This makes the distance measurements less accurate, particularly for local donors.

In the case of the partial ZIP code matching the school ZIP code, I set the distance to zero, on the assumption that the donor lives close to the school. This assumption will tend to overcount the zero distance case, and undercount small distances.

If the partial ZIP code did not match the school ZIP code, I chose a ZIP code with that prefix at random and calculate the distance between that ZIP code and the school's ZIP code. For this calculation I used Kevin T. Ryan's Python code at ActiveState, which I modified slightly to support partial ZIP codes.

The program buckets integers distances and writes the counts to a file. I then used R to plot the distributions show above.

I've put all my code into a GitHub repository.

This hack just scratches the surface of the dataset, and I look forward to seeing some of the cool things that others do in this contest. The closing date is June 30, 2011.

Saturday, 16 April 2011

Whirr in 5 Minutes

A couple of days ago I wrote down a sequence of command lines to install Apache Whirr (an incubator project for running distributed systems on various cloud providers) and run a service from scratch. You just need Java, SSH, and some cloud credentials (Amazon EC2 in this case): I've reproduced the commands here:

export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
curl -O http://www.apache.org/dist/incubator/whirr/whirr-0.4.0-incubating/whirr-0.4.0-incubating.tar.gz
tar zxf whirr-0.4.0-incubating.tar.gz; cd whirr-0.4.0-incubating
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr
bin/whirr launch-cluster --config recipes/zookeeper-ec2.properties --private-key-file ~/.ssh/id_rsa_whirr

At this point you should have a 3 node ZooKeeper cluster running, which is easily checked with

echo "ruok" | nc $(awk '{print $3}' ~/.whirr/zookeeper/instances | head -1) 2181; echo

You can shutdown the cluster with the following command.

bin/whirr destroy-cluster --config recipes/zookeeper-ec2.properties

There are recipes for more services in the Whirr download package, and more detailed instructions in the Quick Start Guide.

Sunday, 28 November 2010

My favourite talk at Devoxx 2010

I went to Devoxx in Antwerp for the first time this year, and really enjoyed it. I didn't go to that many talks, but the quality seemed very high. My favourite talk was "Performance Anxiety" by Josh Bloch, because he's a great speaker and because he presented a single important idea so well.

The idea was this: determining the performance of programs should be treated as an empirical science. We should give up any hope (if any existed) that predicting a program's performance will become easier in the future, since every layer in the deep stack of a modern computer is becoming more complex. Increased complexity is actually the price we must pay for increased performance. And increased complexity leads, almost inevitably, to reduced predictability.

As an experimental demonstration, Josh ran a micro benchmark to sort an array of integers. (The demo actually failed to show what he wanted to show, but he assured us it had worked earlier... It's somehow reassuring when live demos don't work for Java demigods either.) Each invocation of the benchmark did a number of runs, and the timings of the runs converged on a stable value. However, between benchmark invocations, the stable values that they converged on varied by up to 20%.

The reason is subtle: the HotSpot compiler produces different compile plans on different runs, and these have different performance profiles. (This is explained in Cliff Click's 2009 JavaOne presentation, "The Art of (Java) Benchmarking".) They all converge on stable values, but different stable values for different runs. The fact that HotSpot is non-deterministic may not be particularly surprising, but Josh said that the same behaviour has been shown in C code and even assembler, since non-determinism exists at lower levels of the stack too.

The practical upshot is that we need to change how we iteratively benchmark code. No longer is it permissible to run a benchmark, make a change, run the benchmark again, see that the execution time was faster (even across a number of runs in one VM) and legitimately conclude that it was due to the change we made. We have to reach for statistical tools that tell us the improved execution time was significant after we have run enough VMs.

How many VMs? The short answer is "30", the longer answer is in "Statistically Rigorous Java Performance Evaluation" by Andy Georges, Dries Buytaert, and Lieven Eeckhout.

Thankfully there is a Java framework called Caliper which can help you run microbenchmarks and which even plots the error bars for you. This stuff needs to see wider adoption in the industry.