Hadoop has beaten the record for the terabyte sort benchmark, bringing it from 297 seconds to 209. Owen O'Malley wrote the MapReduce program (which by the way has a clever partitioner to ensure the reducer outputs are globally sorted and not just sorted per output partition, which is what the default sort does), and then ran it on 910 nodes on Yahoo!'s cluster. There are more details in Owen's blog post (and there's a link to the benchmark page which has a PDF explaining his program). You can also look at the code in trunk.
Well done Owen and well done Hadoop!
Showing posts with label MapReduce. Show all posts
Showing posts with label MapReduce. Show all posts
Thursday, 3 July 2008
Friday, 20 June 2008
Hadoop Query Languages
If you want a high-level query language for drilling into your huge Hadoop dataset, then you've got some choice:
Meanwhile, to learn more I recommend Pig Latin: A Not-So-Foreign Language for Data Processing (by Chris Olston et al), and the slides and videos from the Hadoop Summit.
(And I haven't even included Cascading, from Chris K. Wensel, which, while not a query language per se, is an abstraction built on MapReduce for building data processing flows in Java or Groovy using a plumbing metaphor with constructs such as taps, pipes, and flows. Well worth a look too.)
- Pig, from Yahoo! and now incubating at Apache, has an imperative language called Pig Latin for performing operations on large data files.
- Jaql, from IBM and soon to be open sourced, is a declarative query language for JSON data.
- Hive, from Facebook and soon to become a Hadoop contrib module, is a data warehouse system with a declarative query language that is a hybrid of SQL and Hadoop streaming.
Meanwhile, to learn more I recommend Pig Latin: A Not-So-Foreign Language for Data Processing (by Chris Olston et al), and the slides and videos from the Hadoop Summit.
(And I haven't even included Cascading, from Chris K. Wensel, which, while not a query language per se, is an abstraction built on MapReduce for building data processing flows in Java or Groovy using a plumbing metaphor with constructs such as taps, pipes, and flows. Well worth a look too.)
Labels:
Hadoop,
MapReduce,
Query Languages
Saturday, 22 March 2008
Learning MapReduce
Update: I've posted my answers to the exercises. Let me know if you find any mistakes. Also: Tamara Petroff has posted a write up of the session.
On Wednesday [19 March], I ran a session at SPA 2008 entitled "Understanding MapReduce with Hadoop". SPA is a very hands-on conference, with many sessions having a methodological slant, so I wanted to get people who had never encountered MapReduce before actually writing MapReduce programs. I only had 75 minutes, so I decided against getting people coding on their laptops. (In hindsight this was a good decision, as I went to several other sessions where we struggled to get the software installed.) Instead, we wrote MapReduce programs on paper, using a simplified notation.
It seemed to work. For the first half hour, I gave as minimal an introduction to MapReduce as I could, then the whole group spent the next half hour working in pairs to express the solutions to a number of exercises as MapReduce programs. We spent the last 15 minutes comparing notes and discussing some of the solutions to the problems.
There were six exercises, presented in rough order of difficulty, and I'm pleased to say that every pair managed to solve at least one. Here's some of the feedback I got:
Finally, thanks to Robert Chatley for being a guinea pig for the exercises, and for helping out on the day with participants' questions during the session.
On Wednesday [19 March], I ran a session at SPA 2008 entitled "Understanding MapReduce with Hadoop". SPA is a very hands-on conference, with many sessions having a methodological slant, so I wanted to get people who had never encountered MapReduce before actually writing MapReduce programs. I only had 75 minutes, so I decided against getting people coding on their laptops. (In hindsight this was a good decision, as I went to several other sessions where we struggled to get the software installed.) Instead, we wrote MapReduce programs on paper, using a simplified notation.
It seemed to work. For the first half hour, I gave as minimal an introduction to MapReduce as I could, then the whole group spent the next half hour working in pairs to express the solutions to a number of exercises as MapReduce programs. We spent the last 15 minutes comparing notes and discussing some of the solutions to the problems.
There were six exercises, presented in rough order of difficulty, and I'm pleased to say that every pair managed to solve at least one. Here's some of the feedback I got:
- Some struggled to know what input data formats to use. Perhaps I glossed over this too much - I didn't want people to worry about precisely how the data was encoded - but I could have emphasised more that you can have the data presented to your map function in any way that's convenient.
- While most people understood the notation I used for writing the map and reduce functions, it did cause some confusion. For example, someone wanted to see the example code again so they could understand what was going on. And another person said it took a while to realise that they could do arbitrary processing as a part of the map and reduce functions. It would be interesting to do the session again but using Java notation.
- It was quite common for people to try to do complex things in their map and reduce functions - they felt bad if they just used an identity function, because it was somehow a waste. And on a related note, chaining map reduce jobs together wasn't obvious to many. But once pointed out, folks had an "aha!" moment and were quick to exploit it.
- The fact that you typically get multiple reduce outputs prompted questions from some - "but how do you combine them into a single answer?". Talking about chained MapReduce helped here again.
- Everyone agreed that it wasn't much like functional programming.
- Find the [number of] hits by 5 minute timeslot for a website given its access logs.
- Find the pages with over 1 million hits in day for a website given its access logs.
- Find the pages that link to each page in a collection of webpages.
- Calculate the proportion of lines that match a given regular expression for a collection of documents.
- Sort tabular data by a primary and secondary column.
- Find the most popular pages for a website given its access logs.
Finally, thanks to Robert Chatley for being a guinea pig for the exercises, and for helping out on the day with participants' questions during the session.
Tuesday, 18 March 2008
"Disks have become tapes"
MapReduce is a programming model for processing vast amounts of data. One of the reasons that it works so well is because it exploits a sweet spot of modern disk drive technology trends. In essence MapReduce works by repeatedly sorting and merging data that is streamed to and from disk at the transfer rate of the disk. Contrast this to accessing data from a relational database that operates at the seek rate of the disk (seeking is the process of moving the disk's head to a particular place on the disk to read or write data).
So why is this interesting? Well, look at the trends in seek time and transfer rate. Seek time has grown at about 5% a year, whereas transfer rate at about 20% [1]. Seek time is growing more slowly than transfer rate - so it pays to use a model that operates at the transfer rate. Which is what MapReduce does. I first saw this observation in Doug Cutting's talk, with Eric Baldeschwieler, at OSCON last year, where he worked through the numbers for updating a 1 terabyte database using the two paradigms B-Tree (seek-limited) and Sort/Merge (transfer-limited). (See the slides and video for more detail.)
The general point was well summed up by Jim Gray in an interview in ACM Queue from 2003:
But even the growth of transfer rate is dwarfed by another measure of disk drives - capacity, which is growing at about 50% a year. David DeWitt argues that since the effective transfer rate of drives is falling we need database systems that work with this trend - such as column-store databases and wider use of compression (since this effectively increases the transfer rate of a disk). Of existing databases he says:
[1] See Trends in Disk Technology by Michael D. Dahlin for changes between 1987-1994. For the period since then these figures still hold - as it's relatively easy to check using manufacturer's data sheets, although with seek time it's harder to tell since the definitions seem to change from year to year and from manufacturer to manufacturer. Still, 5% is generous.
So why is this interesting? Well, look at the trends in seek time and transfer rate. Seek time has grown at about 5% a year, whereas transfer rate at about 20% [1]. Seek time is growing more slowly than transfer rate - so it pays to use a model that operates at the transfer rate. Which is what MapReduce does. I first saw this observation in Doug Cutting's talk, with Eric Baldeschwieler, at OSCON last year, where he worked through the numbers for updating a 1 terabyte database using the two paradigms B-Tree (seek-limited) and Sort/Merge (transfer-limited). (See the slides and video for more detail.)
The general point was well summed up by Jim Gray in an interview in ACM Queue from 2003:
... programmers have to start thinking of the disk as a sequential device rather than a random access device.Or the more pithy: "Disks have become tapes." (Quoted by David DeWitt.)
But even the growth of transfer rate is dwarfed by another measure of disk drives - capacity, which is growing at about 50% a year. David DeWitt argues that since the effective transfer rate of drives is falling we need database systems that work with this trend - such as column-store databases and wider use of compression (since this effectively increases the transfer rate of a disk). Of existing databases he says:
Already we see transaction processing systems running on farms of mostly empty disk drives to obtain enough seeks/second to satisfy their transaction processing rates.But this applies to transfer rate too (or if it doesn't yet, it will). Replace "seeks" with "transfers" and "transaction processing" with "MapReduce" and I think over time we'll start seeing Hadoop installations that choose to use large numbers of smaller capacity disks to maximize their processing rates.
[1] See Trends in Disk Technology by Michael D. Dahlin for changes between 1987-1994. For the period since then these figures still hold - as it's relatively easy to check using manufacturer's data sheets, although with seek time it's harder to tell since the definitions seem to change from year to year and from manufacturer to manufacturer. Still, 5% is generous.
Sunday, 2 March 2008
MapReduce without the Reduce
There's a class of MapReduce applications that use Hadoop just for its distributed processing capabilities. Telltale signs are:
1. Little or no input data of note. (Certainly not large files stored in HDFS.)
2. Map tasks are therefore not limited by their ability to consume input, but by their ability to run the task, which depending on the application may be CPU-bound or IO-bound.
3. Little or map output.
4. No reducers (set by
This seems to work well - indeed the CopyFiles program in Hadoop (aka
1. The input to each map task is a source file and a destination.
2. The map task is limited by its ability to copy the source to the destination (IO-bound).
3. The map output is used as a convenience to record files that were skipped.
4. There are no reducers.
Combined with Streaming this is a neat way to distribute your processing in any language. You do need a Hadoop cluster, it is true, but CPU-intensive jobs would happily co-exist with more traditional MapReduce jobs, which are typically fairly light on CPU usage.
1. Little or no input data of note. (Certainly not large files stored in HDFS.)
2. Map tasks are therefore not limited by their ability to consume input, but by their ability to run the task, which depending on the application may be CPU-bound or IO-bound.
3. Little or map output.
4. No reducers (set by
conf.setNumReduceTasks(0)).This seems to work well - indeed the CopyFiles program in Hadoop (aka
distcp) follows this pattern to efficiently copy files between distributed filesystems:1. The input to each map task is a source file and a destination.
2. The map task is limited by its ability to copy the source to the destination (IO-bound).
3. The map output is used as a convenience to record files that were skipped.
4. There are no reducers.
Combined with Streaming this is a neat way to distribute your processing in any language. You do need a Hadoop cluster, it is true, but CPU-intensive jobs would happily co-exist with more traditional MapReduce jobs, which are typically fairly light on CPU usage.
Wednesday, 30 January 2008
Hadoop and Log File Analysis
I've always thought that Hadoop is a great fit for analyzing log files (I even wrote an article about it). The big win is that you can write ad hoc MapReduce queries against huge datasets and get results in minutes or hours. So I was interested to read Stu Hood's recent post about using Hadoop to analyze email log data:
Can we take this further? It seems to me that there is a gap in the market for an open source web traffic analysis tool. Think Google Analytics where you can write your own queries. I wonder who's going to build such a thing?
Here at Mailtrust, Rackspace’s mail division, we are taking advantage of Hadoop to help us wrangle several hundred gigabytes of email log data that our mail servers generate each day. We’ve built a great tool for our support team that lets them search mail logs in order to troubleshoot problems for customers. Until recently, this log search and storage system was centered around a traditional relational database, which worked fine until the exponential growth in the volume of our dataset overcame what a single machine could cope with. The new logging backend we’ve developed based on Hadoop gives us virtually unlimited scalability.The best bit was when they wrote a MapReduce query to find the geographic distribution of their users.
This data was so useful that we’ve scheduled the MapReduce job to run monthly and we will be using this data to help us decide which Rackspace data centers to place new mail servers in as we grow.It's great when a technology has the ability to make such a positive contribution to your business. In Doug Cutting's words, it is "transformative".
Can we take this further? It seems to me that there is a gap in the market for an open source web traffic analysis tool. Think Google Analytics where you can write your own queries. I wonder who's going to build such a thing?
Sunday, 13 January 2008
MapReduce, Map Reduce, Map/Reduce or Map-Reduce?
Although I've seen the other variants (and used some of them myself), Google call it "MapReduce", so that seems like the right thing to call it to me, since they invented it. The usage figures seem to back up this conclusion. "MapReduce" (no space) has 87,000 Google hits, while "Map Reduce" (with space) has only 50,200, and the latter includes "Map/Reduce" and "Map-Reduce" variants, since Google (and search engines in general) ignore punctuation.
In this age of case sensitivity and camel case one has to watch out for these things. In fact, I've only just realised that the Hadoop database is called "HBase", not "Hbase". The curse of wiki names. And the logo doesn't help either - it's all caps!
In this age of case sensitivity and camel case one has to watch out for these things. In fact, I've only just realised that the Hadoop database is called "HBase", not "Hbase". The curse of wiki names. And the logo doesn't help either - it's all caps!
Monday, 7 January 2008
Casual Large Scale Data Processing
I think Greg Linden hits the nail on the head when he says of MapReduce at Google:
What is so remarkable about this is how casual it makes large scale data processing. Anyone at Google can write a MapReduce program that uses hundreds or thousands of machines from their cluster. Anyone at Google can process terabytes of data. And they can get their results back in about 10 minutes, so they can iterate on it and try something else if they didn't get what they wanted the first time.I think this is a good way of looking at what Hadoop is trying to achieve - to make large scale data processing as natural as small scale data processing. Hadoop can provide infrastructure to do this, but there is also a need for open datasets that are MapReducable. Imagine if I could run a MapReduce program against Google's web crawl database, Amazon's sales data or Facebook's social graph.
Subscribe to:
Posts (Atom)
I am a consultant software engineer, specializing in