Showing posts with label Hadoop. Show all posts
Showing posts with label Hadoop. Show all posts

Thursday, 3 July 2008

Hadoop beats terabyte sort record

Hadoop has beaten the record for the terabyte sort benchmark, bringing it from 297 seconds to 209. Owen O'Malley wrote the MapReduce program (which by the way has a clever partitioner to ensure the reducer outputs are globally sorted and not just sorted per output partition, which is what the default sort does), and then ran it on 910 nodes on Yahoo!'s cluster. There are more details in Owen's blog post (and there's a link to the benchmark page which has a PDF explaining his program). You can also look at the code in trunk.

Well done Owen and well done Hadoop!

Friday, 20 June 2008

Hadoop Query Languages

If you want a high-level query language for drilling into your huge Hadoop dataset, then you've got some choice:
  • Pig, from Yahoo! and now incubating at Apache, has an imperative language called Pig Latin for performing operations on large data files.
  • Jaql, from IBM and soon to be open sourced, is a declarative query language for JSON data.
  • Hive, from Facebook and soon to become a Hadoop contrib module, is a data warehouse system with a declarative query language that is a hybrid of SQL and Hadoop streaming.
All three projects have different strengths, but there is plenty of scope for collaboration and cross-pollination, particularly in the query language. For example, at the Hadoop Summit in March, Joydeep Sen Sarma of Facebook said that they would be receptive to users who wanted to use Pig Latin or Jaql in Hive. And Kevin Beyer of IBM Research said that Pig and Jaql are converging, and they've had discussions with the Pig team about how to bring them even closer together.

Meanwhile, to learn more I recommend Pig Latin: A Not-So-Foreign Language for Data Processing (by Chris Olston et al), and the slides and videos from the Hadoop Summit.

(And I haven't even included Cascading, from Chris K. Wensel, which, while not a query language per se, is an abstraction built on MapReduce for building data processing flows in Java or Groovy using a plumbing metaphor with constructs such as taps, pipes, and flows. Well worth a look too.)

Friday, 13 June 2008

"The Next Big Thing"

James Hamilton on The Next Big Thing:
Storing blobs in the sky is fine but pretty reproducible by any competitor. Storing structured data as well as blobs is considerably more interesting but what has even more lasting business value is the storing data in the cloud AND providing a programming platform for multi-thousand node data analysis. Almost every reasonable business on the planet has a complex set of dimensions that need to be optimized.

I think we're only beginning to see interesting data processing being done in the cloud - there's much more to come.

Tuesday, 22 April 2008

Portable Cloud Computing

Last July I asked "Why are there no Amazon S3/EC2 competitors?", lamenting the lack of competition in the utility or cloud computing market and the implications for disaster recovery. Closely tied to disaster recover is portability -- the ability to switch between different utility computing providers as easily as I switch electricity suppliers. (OK, that can be a pain, at least in the UK, but it doesn't require rewiring my house.)

It's useful to compare Amazon Web Services with Google's recently launched App Engine in these terms. In some sense they compete, but they are strikingly different offerings. Amazon's services are bottom up: "here's a CPU, now install your own software". Google's is top down: "write some code to these APIs and we'll scale it for you". There's no way I can take my EC2 services and run them on App Engine. But I can do the reverse -- sort of -- thanks to AppDrop.

And that's the thing. What I would love is a utility service from different providers that I can switch between. That's one of the meanings of "utility", after all. (Moving lots of data between providers may make this difficult or expensive to do in practice -- "data inertia" -- but that's not a reason not to have the capability.)

There are at least two ways this can happen. One's the AppDrop approach -- emulate Google's API and provide an alternative place to run applications, in this case it's EC2.

However, there's another way: build "standard, non-proprietary cloud APIs with open-source implementations", as Doug Cutting says on his blog post Cloud: commodity or proprietary? In this world, applications use a common API, and host with whichever provider they fancy. Bottom up offerings like Amazon's facilitate this approach: the underlying platforms may differ, but it's easy to run your application on the provided platform -- for example, by building an Amazon AMI. Google's top down approach is not so amenable, application developers are locked-in to the APIs Google provide. (Of course, Google may open this platform up more over time, but it remains to be seen if they will open it up to the extent of being able to run arbitrary executables.)

As Doug notes, Hadoop is providing a lot of the building blocks for building cloud services: filesystem (HDFS), database (HBase), computation (MapReduce), coordination (ZooKeeper). And here, perhaps, is where the two approaches may meet -- AppDrop could be backed by HBase (or indeed Hypertable), or HBase (and Hypertable) could have an open API which your application could use directly.

Rails or Django on HBase, anyone?

Monday, 14 April 2008

Hadoop at ApacheCon Europe

On Friday in Amsterdam there was a lot of Hadoop on the menu at ApacheCon. I kicked it off at 9am with A Tour of Apache Hadoop, Owen O'Malley followed with Programming with Hadoop’s Map/Reduce, and Allen Wittenauer finished off after lunch with Deploying Grid Services using Apache Hadoop. Find the slides on the Hadoop Presentations page of the wiki. I've also embedded mine below.

I only saw half of Allen's talk as I had to catch my plane, but I was there long enough to see his interesting choice of HDFS users... :)



Also at ApacheCon I enjoyed meeting the Mahout people (Grant, Karl, Isabel and Erik), seeing Cliff Schmidt's keynote, and generally meeting interesting people.

Saturday, 22 March 2008

Learning MapReduce

Update: I've posted my answers to the exercises. Let me know if you find any mistakes. Also: Tamara Petroff has posted a write up of the session.

On Wednesday [19 March], I ran a session at SPA 2008 entitled "Understanding MapReduce with Hadoop". SPA is a very hands-on conference, with many sessions having a methodological slant, so I wanted to get people who had never encountered MapReduce before actually writing MapReduce programs. I only had 75 minutes, so I decided against getting people coding on their laptops. (In hindsight this was a good decision, as I went to several other sessions where we struggled to get the software installed.) Instead, we wrote MapReduce programs on paper, using a simplified notation.

It seemed to work. For the first half hour, I gave as minimal an introduction to MapReduce as I could, then the whole group spent the next half hour working in pairs to express the solutions to a number of exercises as MapReduce programs. We spent the last 15 minutes comparing notes and discussing some of the solutions to the problems.

There were six exercises, presented in rough order of difficulty, and I'm pleased to say that every pair managed to solve at least one. Here's some of the feedback I got:
  • Some struggled to know what input data formats to use. Perhaps I glossed over this too much - I didn't want people to worry about precisely how the data was encoded - but I could have emphasised more that you can have the data presented to your map function in any way that's convenient.
  • While most people understood the notation I used for writing the map and reduce functions, it did cause some confusion. For example, someone wanted to see the example code again so they could understand what was going on. And another person said it took a while to realise that they could do arbitrary processing as a part of the map and reduce functions. It would be interesting to do the session again but using Java notation.
  • It was quite common for people to try to do complex things in their map and reduce functions - they felt bad if they just used an identity function, because it was somehow a waste. And on a related note, chaining map reduce jobs together wasn't obvious to many. But once pointed out, folks had an "aha!" moment and were quick to exploit it.
  • The fact that you typically get multiple reduce outputs prompted questions from some - "but how do you combine them into a single answer?". Talking about chained MapReduce helped here again.
  • Everyone agreed that it wasn't much like functional programming.
You can find the slides on the Hadoop wiki. They include the six exercises, which I've reproduced below, in rough order of difficulty. (I'll post my answers next week.)
  1. Find the [number of] hits by 5 minute timeslot for a website given its access logs.
  2. Find the pages with over 1 million hits in day for a website given its access logs.
  3. Find the pages that link to each page in a collection of webpages.
  4. Calculate the proportion of lines that match a given regular expression for a collection of documents.
  5. Sort tabular data by a primary and secondary column.
  6. Find the most popular pages for a website given its access logs.
Is this a good list of exercises? Do you have any exercises that you've found useful for learning MapReduce?

Finally, thanks to Robert Chatley for being a guinea pig for the exercises, and for helping out on the day with participants' questions during the session.

Tuesday, 18 March 2008

"Disks have become tapes"

MapReduce is a programming model for processing vast amounts of data. One of the reasons that it works so well is because it exploits a sweet spot of modern disk drive technology trends. In essence MapReduce works by repeatedly sorting and merging data that is streamed to and from disk at the transfer rate of the disk. Contrast this to accessing data from a relational database that operates at the seek rate of the disk (seeking is the process of moving the disk's head to a particular place on the disk to read or write data).

So why is this interesting? Well, look at the trends in seek time and transfer rate. Seek time has grown at about 5% a year, whereas transfer rate at about 20% [1]. Seek time is growing more slowly than transfer rate - so it pays to use a model that operates at the transfer rate. Which is what MapReduce does. I first saw this observation in Doug Cutting's talk, with Eric Baldeschwieler, at OSCON last year, where he worked through the numbers for updating a 1 terabyte database using the two paradigms B-Tree (seek-limited) and Sort/Merge (transfer-limited). (See the slides and video for more detail.)

The general point was well summed up by Jim Gray in an interview in ACM Queue from 2003:
... programmers have to start thinking of the disk as a sequential device rather than a random access device.
Or the more pithy: "Disks have become tapes." (Quoted by David DeWitt.)

But even the growth of transfer rate is dwarfed by another measure of disk drives - capacity, which is growing at about 50% a year. David DeWitt argues that since the effective transfer rate of drives is falling we need database systems that work with this trend - such as column-store databases and wider use of compression (since this effectively increases the transfer rate of a disk). Of existing databases he says:
Already we see transaction processing systems running on farms of mostly empty disk drives to obtain enough seeks/second to satisfy their transaction processing rates.
But this applies to transfer rate too (or if it doesn't yet, it will). Replace "seeks" with "transfers" and "transaction processing" with "MapReduce" and I think over time we'll start seeing Hadoop installations that choose to use large numbers of smaller capacity disks to maximize their processing rates.

[1] See Trends in Disk Technology by Michael D. Dahlin for changes between 1987-1994. For the period since then these figures still hold - as it's relatively easy to check using manufacturer's data sheets, although with seek time it's harder to tell since the definitions seem to change from year to year and from manufacturer to manufacturer. Still, 5% is generous.

Sunday, 2 March 2008

MapReduce without the Reduce

There's a class of MapReduce applications that use Hadoop just for its distributed processing capabilities. Telltale signs are:

1. Little or no input data of note. (Certainly not large files stored in HDFS.)
2. Map tasks are therefore not limited by their ability to consume input, but by their ability to run the task, which depending on the application may be CPU-bound or IO-bound.
3. Little or map output.
4. No reducers (set by conf.setNumReduceTasks(0)).

This seems to work well - indeed the CopyFiles program in Hadoop (aka distcp) follows this pattern to efficiently copy files between distributed filesystems:

1. The input to each map task is a source file and a destination.
2. The map task is limited by its ability to copy the source to the destination (IO-bound).
3. The map output is used as a convenience to record files that were skipped.
4. There are no reducers.

Combined with Streaming this is a neat way to distribute your processing in any language. You do need a Hadoop cluster, it is true, but CPU-intensive jobs would happily co-exist with more traditional MapReduce jobs, which are typically fairly light on CPU usage.

Friday, 1 February 2008

Apache Incubator Proposal for Thrift

There's a proposal for Thrift to go into the Apache Incubator. This seems to me to be a good move - there's increasing interest in Thrift - just look at the number of language bindings that have been contributed: Cocoa/Objective C, C++, C#, Erlang, Haskell, Java, OCaml, Perl, PHP, Python, Ruby, and Squeak at the last count. It's even fairly painless to compile on Mac OS X now, although it'd be nice to have a Java version of the compiler.

Also, there are some nice synergies with other Apache projects - it is already being used in HBase, and there are moves to make it easier to use in Hadoop Core as a serialization format (so MapReduce jobs can consume and produce Thrift-formatted data).

If the proposal is accepted it will be interesting to see what happens to Hadoop's own language-neutral record serialization package, Record I/O. The momentum is certainly with Thrift and discussions on the mailing list suggest that stuff will eventually be ported to use Thrift.

Wednesday, 30 January 2008

Hadoop and Log File Analysis

I've always thought that Hadoop is a great fit for analyzing log files (I even wrote an article about it). The big win is that you can write ad hoc MapReduce queries against huge datasets and get results in minutes or hours. So I was interested to read Stu Hood's recent post about using Hadoop to analyze email log data:
Here at Mailtrust, Rackspace’s mail division, we are taking advantage of Hadoop to help us wrangle several hundred gigabytes of email log data that our mail servers generate each day. We’ve built a great tool for our support team that lets them search mail logs in order to troubleshoot problems for customers. Until recently, this log search and storage system was centered around a traditional relational database, which worked fine until the exponential growth in the volume of our dataset overcame what a single machine could cope with. The new logging backend we’ve developed based on Hadoop gives us virtually unlimited scalability.
The best bit was when they wrote a MapReduce query to find the geographic distribution of their users.
This data was so useful that we’ve scheduled the MapReduce job to run monthly and we will be using this data to help us decide which Rackspace data centers to place new mail servers in as we grow.
It's great when a technology has the ability to make such a positive contribution to your business. In Doug Cutting's words, it is "transformative".

Can we take this further? It seems to me that there is a gap in the market for an open source web traffic analysis tool. Think Google Analytics where you can write your own queries. I wonder who's going to build such a thing?

Wednesday, 16 January 2008

Hadoop is now an Apache Top Level Project

Doug Cutting just reported on the Hadoop lists that the Apache board voted this morning (US time) to make Hadoop a TLP. Until now it has been a Lucene subproject, which made sense when Hadoop was broken out from the Nutch codebase two years ago. Since then Hadoop has grown dramatically. This change will make it possible for a number of associated projects - such as HBase and Pig - to become Hadoop subprojects in their own right. (There are more details in the original discussion.)

Thanks for looking after us Lucene - it's been a great time so far and we'll keep in touch!

Sunday, 13 January 2008

MapReduce, Map Reduce, Map/Reduce or Map-Reduce?

Although I've seen the other variants (and used some of them myself), Google call it "MapReduce", so that seems like the right thing to call it to me, since they invented it. The usage figures seem to back up this conclusion. "MapReduce" (no space) has 87,000 Google hits, while "Map Reduce" (with space) has only 50,200, and the latter includes "Map/Reduce" and "Map-Reduce" variants, since Google (and search engines in general) ignore punctuation.

In this age of case sensitivity and camel case one has to watch out for these things. In fact, I've only just realised that the Hadoop database is called "HBase", not "Hbase". The curse of wiki names. And the logo doesn't help either - it's all caps!

Monday, 7 January 2008

Casual Large Scale Data Processing

I think Greg Linden hits the nail on the head when he says of MapReduce at Google:
What is so remarkable about this is how casual it makes large scale data processing. Anyone at Google can write a MapReduce program that uses hundreds or thousands of machines from their cluster. Anyone at Google can process terabytes of data. And they can get their results back in about 10 minutes, so they can iterate on it and try something else if they didn't get what they wanted the first time.
I think this is a good way of looking at what Hadoop is trying to achieve - to make large scale data processing as natural as small scale data processing. Hadoop can provide infrastructure to do this, but there is also a need for open datasets that are MapReducable. Imagine if I could run a MapReduce program against Google's web crawl database, Amazon's sales data or Facebook's social graph.

Sunday, 14 October 2007

Moving to Wales

Next week we are moving to the Brecon Beacons in Wales. This is a big change for us: I've been living in London for the past 11 years and Eliane for the last 17. But in some ways it's not so big as we both grew up in small towns. We're renting a barn on a hill, and Eliane is already planning the veg patch.

Work-wise I shall be leaving Kizoom at the end of November. I am one of the three original members (with Nick Knowles and Mike Storey) of the team that started Kizoom in September 1999, in Nick's library (Islington people have libraries and garages).

After that I shall be working freelance and specializing in Hadoop. The plan is to work from home as much as possible, but I'm realistic about the need to travel to make this happen. So - if you need help with a Hadoop-related project, please get in touch.

Monday, 6 August 2007

Tim O'Reilly on Hadoop

Tim O'Reilly has a written a nice piece about Hadoop, following its starring role at OSCON. I was tickled pink to see that he linked to my article about running Hadoop on Amazon EC2 to do log analysis!

Also, from the comments: Google has some courseware on MapReduce and Hadoop. Everyone seems to love Hadoop at the moment.

Saturday, 28 July 2007

Hadoop at OSCON 2007

I wasn't there, but Hadoop had two airings at OSCON this week. Doug Cutting was a part of the Open Source Radar Executive Briefing with Tim O'Reilly to talk about scaling.

He also gave a talk, with Eric Baldeschwieler, entitled "Meet Hadoop" where he gave a great exposition of the problem that Hadoop is designed to solve. In short: disk seeks are expensive, so databases built using sort-merge, which is limited by transfer speed not seek speed, scale better than traditional B-tree databases, which are limited by seek speed. More details and examples on the slides.

Eric's half of the talk gave some interesting tidbits about how Hadoop is being used at Yahoo! For example, they are running Hadoop on about 10,000 machines, and the biggest cluster is 1600 machines! With these kind of numbers I can see how Nigel Daley came to coin Nigel's Law:

In a large enough cluster, there are NO corner cases

Thursday, 19 July 2007

Articles and blogs

Coinciding nicely with Jakob Nielson's admonition Write Articles, Not Blog Postings (via Steve Loughran) here's my latest article, Running Hadoop MapReduce on Amazon EC2 and Amazon S3, published on Amazon Web Services Developer Connection.

It's almost a year and a half since I wrote my last article and one of the reasons is because I've blogged more. Not much, but more.

Jakob's got a point about being directed to short, old, irrelevant blog postings when you're looking for something. He's not saying don't blog, but rather try to write stuff that will be long-lived. Definitely something to aim for for tech writers - but not at the cost of not writing anything (so it's OK to sprinkle blogs with lighter weight stuff).

Sunday, 15 July 2007

Hadoop Development Steadily Rising

Judging by this graph showing posts to the dev list (on Gmane) the rate is currently at about 50 posts a day. This has roughly doubled since the beginning of the year. Some of the increase is down to the momentum behind Hbase (which provides Bigtable-like capabilities on top of Hadoop), but I think it is also down to general growth - more people seem to be participating in development than ever before. This is great! Obviously the risk is that at such a rate of development Hadoop becomes unstable, but so far it looks like it's under control - the (informal) feedback I've had tells me the 0.13.0 release is one of the most stable we've done.