Distributed systems for fun and profit. Introduction I wanted a text that would bring together the ideas behind many of the more recent distributed systems - systems such as Amazon's Dynamo, Google's BigTable and MapReduce, Apache's Hadoop and so on.
In this text I've tried to provide a more accessible introduction to distributed systems. To me, that means two things: introducing the key concepts that you will need in order to have a good time reading more serious texts, and providing a narrative that covers things in enough detail that you get a gist of what's going on without getting stuck on details. It's 2013, you've got the Internet, and you can selectively read more about the topics you find most interesting. Readings in distributed systems. This post is a work in progress.
Inspired by a recent purchase of the Red Book, which provides a curated list of important papers around database systems, I’ve decided to begin assembling a list of important papers in distributed systems. Similar to the Red Book, I’ve broken each group of papers out into a series of categories, each highlighting a progression of related ideas over time focused in a specific area of research within the field.
Keeping the tradition of the Red Book, I’ve included both papers which resulted in very successful systems and/or techniques, as well as papers which introduced a concept which was either immediately dismissed or proven incorrect. This emphasizes the progression of ideas which lead to the development of these systems. Readings in distributed systems. (194) What is the best literature on the design of database platforms? Why? Queues. Distributed Systems Reading List. Introduction I often argue that the toughest thing about distributed systems is changing the way you think.
The below is a collection of material I've found useful for motivating these changes. Thought Provokers Ramblings that make you think about the way you design. Not everything can be solved with big servers, databases and transactions. Amazon Somewhat about the technology but more interesting is the culture and organization they've created to work with it. Distributed Systems Reading List. Introduction I often argue that the toughest thing about distributed systems is changing the way you think.
The below is a collection of material I've found useful for motivating these changes. Thought Provokers Ramblings that make you think about the way you design. Not everything can be solved with big servers, databases and transactions. Amazon Somewhat about the technology but more interesting is the culture and organization they've created to work with it. Distributed Caches. The Dempsy Real-Time BigData Framework. Welcome to the Dempsy project - a framework for the easy implementation of stream-based real-time fully-distributed elastic analytics processing applications.
Features If you're already familiar with real-time stream based BigData engines, the following list of features will distinguish Dempsy from the others: Fine grained "actor model": Dempsy provides for the fine grained distribution and lifecycle management of (potentially) millions of "actors" (message processors in Dempsy parlance) across a large cluster of machines allowing developers to write code that concentrates on handling individual data points in a stream without any concern for concurrency. Inversion of control programming paradigm: Dempsy allows developers to construct these large-scale processing applications decoupled from all infrastructure concerns providing a means for simple and testable POJO implementations. Documentation See the Dempsy User Guide. This Api documentation is currently for the released 0.7.9 revision. Model-based monitoring with CFEngine - CFEngine - Distributed Configuration Management.
"A model is a lie that helps you to see the truth.
" (Howard Skipper) "There is nothing more practical than a good theory. " (Kurt Lewin) The past year has seen a plethora, one might even say an entire movement, of talks and blog posts under the heading "Monitoring Sucks". Plenty of valid criticisms have been made about the state of the art in monitoring. Mesos: Dynamic Resource Sharing for Clusters. Process Perfection. Well over a year ago, in a conversation with Alexis Richardson, I came up with a catchy acronym to articulate an idea that I had been kicking around as a simple way to respond to all of the Sturm und Drang in the press and the blogosphere about "lock-in", "data portability" and reliability of cloud computing providers.
I said -- "You know what, mate, done properly, it would be like a RAID setup -- it would be an array of cloud providers. Umm, yeah, it would be RAIC! 'Redundant Array of Independent Cloud providers'". Alexis, as I recall, burst out laughing, and said something like "You better trademark that, Mark. That's great. " A few weeks later, I sat down, and wrote a blog post to try to describe the idea in some detail. Despite all that, the term has gotten some traction. (51) Distributed Systems: What are the best resources for learning about distributed file systems.
(51) What are some good resources for learning about distributed computing? Why. Distributed-systems-readings. CAP Twelve Years Later: How the "Rules" Have Changed. This article first appeared in Computer magazine and is brought to you by InfoQ & IEEE Computer Society.
The CAP theorem asserts that any networked shared-data system can have only two of three desirable properties. However, by explicitly handling partitions, designers can optimize consistency and availability, thereby achieving some trade-off of all three. In the decade since its introduction, designers and researchers have used (and sometimes abused) the CAP theorem as a reason to explore a wide variety of novel distributed systems. The NoSQL movement also has applied it as an argument against traditional databases.
Design and Implementation of a Real-Time Cloud Analytics Platform. In Memory Data Grid Technologies. After winning a CSC Leading Edge Forum (LEF) research grant, I (Paul Colmer) wanted to publish some of the highlights of my research to share with the wider technology community.
What is an In Memory Data Grid? It is not an in-memory relational database, a NOSQL database or a relational database. It is a different breed of software datastore. In summary an IMDG is an ‘off the shelf’ software product that exhibits the following characteristics: The data model is distributed across many servers in a single location or across multiple locations. All servers can be active in each site.All data is stored in the RAM of the servers.Servers can be added or removed non-disruptively, to increase the amount of RAM available.The data model is non-relational and is object-based. There are also hardware appliances that exhibit all these characteristics. IndexTank is now open source! We are proud to announce that the technology behind IndexTank has just been released as open-source software under the Apache 2.0 License!
We promised to do this when LinkedIn acquired IndexTank, so here we go: indextank-engine: Indexing engine indextank-service: API, BackOffice, Storefront, and Nebulizer We know that many of our users and other interested parties have been patiently waiting for this release. We want to thank you for your patience, for your kind emails, and for your continued support. We are looking forward to seeing IndexTank thrive as an open-source project. What's IndexTank? What, you had never heard of IndexTank until now? IndexEngine: a real-time fulltext search-and-indexing system designed to separate relevance signals from document text.
Most of you won’t need the third part. Try it out! Head over to GitHub and give the IndexTank Engine and IndexTank Service code a try. Thanks again and happy holidays! The IndexTank dev team. Paper: Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS. Teams from Princeton and CMU are working together to solve one of the most difficult problems in the repertoire: scalable geo-distributed data stores. Major companies like Google and Facebook have been working on multiple datacenter database functionality for some time, but there's still a general lack of available systems that work for complex data scenarios. The ideas in this paper--Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS--are different. It's not another eventually consistent system, or a traditional transaction oriented system, or a replication based system, or a system that punts on the issue.
It's something new, a causally consistent system that achieves ALPS system properties. ALPS sounds great, but we want more, we want consistency guarantees as well. Intrigued? In a talk on COPS, Wyatt Lloyd, defines consistency as a restriction on the ordering and timing of operations. The driver for causal consistency is low latency. Lock-free extensible hash tables back by split-ordered lists; a summary. Save this one for a crazy Friday night! : programming. Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase comparison. DIDO_White_Paper_110727.
Engineering: Tools for Debugging Distributed Systems. Golden Orb. Realtime Hadoop usage at Facebook: The Complete Story. Neil’s Research Blog. Advanced Computer Science Courses : Paper Trail. Below I’ve collected some links to advanced computer science courses on-line. I’m concentrating on courses with good lecture notes, rather than video lectures, and I’m applying a rather arbitrary filter for quality (otherwise this becomes a directory with less semantic utility).
This is the good stuff! But only a subset of it – any recommendations for good courses are gratefully received. I’m mainly interested in systems, data-structures and mathematics, so reserve the right to choose topics at will. Courses are organised by broad topic. Systems Graduate level operating systems courses don’t typically have notes – they all come with long reading lists taken from SOSP and other places. Cornell CS 614 – Advanced Course in Computer Systems – Ken Birman teaches this course. Databases. Zuse-Institut Berlin: Publikationen. 35+ Use Cases for Choosing Your Next NoSQL Database. We've asked What The Heck Are You Actually Using NoSQL For?. We've asked 101 Questions To Ask When Considering A NoSQL Database. We've even had a webinar What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web Applications. Now we get to the point of considering use cases and which systems might be appropriate for those use cases.
What are your options? First, let's cover what are the various data models. Document Databases.