At work we use Mesos to manage a dynamic pool of resources for our Spark jobs and while I’ve also dabbled with Hadoop in the past, I’ve found my understanding of the mechanics and initial design concerns of all of the above a bit lacking. Rather than parrotting the standard marketing lines, I wanted to understand the fundamental reasons and motivation behind treating a pool of machines (virtual or otherwise) as an anonymized pile of aggregate CPUs and RAM for many different types of work. A non-exhaustive list of things I want to understand are:
- How are larger jobs broken down and parallelized?
- How are parts of jobs made to coexist in an environment of constantly changing demands?
- How do we ensure full utilization of extant resources and do we need to in a cloud environment as opposed to simply scaling the resources?
- While I understand the basics of being concerned with “data locality” (get the computation as close as possible to the data), what are some practical ways to achieve this in a general way, applicable to many applications?
In hopes of answering these (and other) questions, I started to compile this reading list of papers that seemed like they would help answer them. An earlier form of this list was posted as a gist here.
At present I have completed the Google File System, MapReduce, and Microsoft Dryad papers, all of which I have thoroughly enjoyed reading and they have already contributed to a better understanding of how I should be approaching Spark.
This list was built primarily by crawling through the references sections of a few different papers. Some of these have come by way of suggestions from Omar and Michael as noted below. In pursuit of a reasonable reading and study plan for myself, I tried to lay the material out in something of a logical path from early successes to more recent ideas and implementations.
If you feel I’ve left out anything particularly awesome or am incorrect in any assumptions/descriptions with respect to the later papers, I’d love to hear about it and will update my list accordingly.
A brief list of papers and ideas that I thought would give me something of a foundation in what the main problems, concerns, and early solutions were.
- Google File System – most of the later material references this paper and it seems like this enabled a lot of other ideas later on. The distribution of chunks of a file around a cluster enables a lot of interesting data locality stuff.
- Google MapReduce – a pretty lucid and well thought out divide-and-conquer approach to parallelizing a huge job. The functional programming part of me loved the LISP shout-out/inspiration.
- Dryad for a more fine-grained/raw approach to programming against a cluster.
- Quincy is, from what I understand so far, a more detailed look at an alternative scheduler for Dryad.
- Delay Scheduling for a detailed look at a different Hadoop scheduler. Note the involvement of Matei Zaharia who later worked on both the Mesos and Spark projects.
Just the two big ones here, unfortunately there don’t seem to be any papers out there on Google’s Borg project.
- Mesos as mentioned above is what I currently use to manage Spark resources.
- Omega is (was?) a project whose aim is/was to fix shortcomings in Google’s existing cluster scheduling/management system (assuming Borg).
More Programming Models
- DryadLINQ is a higher-level programming model built on top of Dryad. Composition FTW.
- Pregel a graph processing paper from Google. References Dryad as well but I’m not sure how close it is as I haven’t read it yet and it seems to be less about dataflow.
- MapReduce Online for an iterative MapReduce.
- Distributed GraphLab sounds like another graph-like computation model but one that embraces asynchrony and parallelism a bit more. Possibly belongs in the next category but I’m not sure yet.
- MillWheel as recommended by Omar talks about stream processing but does appear to tackle the problem with a form of computation graph that looks like a dataflow approach.
- Naiad is relatively recent from Microsoft Research (2013) and intriguingly mentions “cyclic dataflow” (also via Omar).
- Derflow is about “deterministic dataflow” with Erlang. Sounds too good to pass up.
- Condor is referenced both by MapReduce and Dryad as an inspiration for cluster management approaches so I’ve added it to get a bit more perspective at the end.
- The Datacenter as a Computer (thanks to Michael for the reminder) is a bit over 100 pages but seems to cover a lot more of Google’s general approach to clusters and more.