Distributed Systems Reading List
Welcome to the Dempsy project - a framework for the easy implementation of stream-based real-time fully-distributed elastic analytics processing applications. Features If you're already familiar with real-time stream based BigData engines, the following list of features will distinguish Dempsy from the others: Fine grained "actor model": Dempsy provides for the fine grained distribution and lifecycle management of (potentially) millions of "actors" (message processors in Dempsy parlance) across a large cluster of machines allowing developers to write code that concentrates on handling individual data points in a stream without any concern for concurrency. Inversion of control programming paradigm: Dempsy allows developers to construct these large-scale processing applications decoupled from all infrastructure concerns providing a means for simple and testable POJO implementations. The Dempsy Real-Time BigData Framework
Model-based monitoring with CFEngine - CFEngine - Distributed Configuration Management "A model is a lie that helps you to see the truth." (Howard Skipper) "There is nothing more practical than a good theory." (Kurt Lewin) The past year has seen a plethora, one might even say an entire movement, of talks and blog posts under the heading "Monitoring Sucks".
Mesos: Dynamic Resource Sharing for Clusters
Process Perfection Well over a year ago, in a conversation with Alexis Richardson, I came up with a catchy acronym to articulate an idea that I had been kicking around as a simple way to respond to all of the Sturm und Drang in the press and the blogosphere about "lock-in", "data portability" and reliability of cloud computing providers. I said -- "You know what, mate, done properly, it would be like a RAID setup -- it would be an array of cloud providers. Umm, yeah, it would be RAIC! 'Redundant Array of Independent Cloud providers'". Alexis, as I recall, burst out laughing, and said something like "You better trademark that, Mark. That's great."
(51) Distributed Systems: What are the best resources for learning about distributed file systems I highly recommend reading NFS Illustrated. Broadly speaking, distributed filesystems have frontends ( which deal with naming, namespace, file semantics, file access protocols, locking protocols, serialization formats, resource discovery, authentication, caching, how protocol implementation interacts with the client OS etc.) and backends ( specifically, on disk formats, write allocation, filesystem consistency, disk interaction, read and write performance and the like). This book is a somewhat gentle introduction the frontend aspects and later on also talks about other more truly distributed filesystems. Distributed is a rather overloaded term in the context of distributed filesystems. so with it can mean multiple things: 1.
I'll assume that you mean distributed computing and not distributed databases. If that's the case, you're going to use map-reduce in some form, most likely Hadoop. Don't start by reading a bunch of books and papers that you probably won't understand until you've done some map-reducing yourself. (51) What are some good resources for learning about distributed computing? Why
Systems Graduate level operating systems courses don't typically have notes - they all come with long reading lists taken from SOSP and other places. In this way, systems research is a bit more like a humanities subject: it's vital to read the primary sources. Distributed-systems-readings
This article first appeared in Computer magazine and is brought to you by InfoQ & IEEE Computer Society. The CAP theorem asserts that any networked shared-data system can have only two of three desirable properties. However, by explicitly handling partitions, designers can optimize consistency and availability, thereby achieving some trade-off of all three. In the decade since its introduction, designers and researchers have used (and sometimes abused) the CAP theorem as a reason to explore a wide variety of novel distributed systems. The NoSQL movement also has applied it as an argument against traditional databases.
Design and Implementation of a Real-Time Cloud Analytics Platform
In Memory Data Grid Technologies After winning a CSC Leading Edge Forum (LEF) research grant, I (Paul Colmer) wanted to publish some of the highlights of my research to share with the wider technology community. What is an In Memory Data Grid? It is not an in-memory relational database, a NOSQL database or a relational database. It is a different breed of software datastore. In summary an IMDG is an ‘off the shelf’ software product that exhibits the following characteristics:
IndexTank is now open source! We are proud to announce that the technology behind IndexTank has just been released as open-source software under the Apache 2.0 License! We promised to do this when LinkedIn acquired IndexTank, so here we go: indextank-engine: Indexing engine indextank-service: API, BackOffice, Storefront, and Nebulizer We know that many of our users and other interested parties have been patiently waiting for this release. We want to thank you for your patience, for your kind emails, and for your continued support.
Teams from Princeton and CMU are working together to solve one of the most difficult problems in the repertoire: scalable geo-distributed data stores. Major companies like Google and Facebook have been working on multiple datacenter database functionality for some time, but there's still a general lack of available systems that work for complex data scenarios. The ideas in this paper--Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS--are different. It's not another eventually consistent system, or a traditional transaction oriented system, or a replication based system, or a system that punts on the issue. It's something new, a causally consistent system that achieves ALPS system properties. Paper: Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS
That's not a trie, there are only two levels. Oh, I thought that thing was meant to be somehow recursive, my bad. A trie can have limited depth, too, though, see e.g. Haskell's IntMap, where the maximum depth is sizeof(int), and sorting by the lsb's let to the firing of some pattern recognition neurons. As growing is hopscotch's weak point, linear hashing with rather large hopscotch buckets would be a thing worth investigating, I think. Eliminating the worst case might very well be worth one extra cache miss. Lock-free extensible hash tables back by split-ordered lists; a summary. Save this one for a crazy Friday night! : programming
Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase comparison
We've previously written about the importance of internal tooling for creating a culture of empowering engineers and building a leveraged business. Our first example was adding bash completion to a curl wrapper script. Today I'd like to describe some of the internal tooling we use to make ourselves more productive in the distributed service oriented architecture that we maintain in our production environment. The three things I'll be talking about are distributed tracing, profiling across a large group of machines and building a REPL environment for working with your code on an ad hoc basis. Engineering: Tools for Debugging Distributed Systems
Realtime Hadoop usage at Facebook: The Complete Story
Everything is Data | Neil’s Research Blog Botnets are used for various nefarious ends; one popular use is sending spam email by creating and then using accounts on free webmail providers like Hotmail and Google Mail. In the past, CAPTCHAs have been used to try to prevent this, but they are increasingly ineffective. Hence, the BotGraph paper proposes an algorithm for detecting bot-created accounts by analyzing user access behavior.
Below I’ve collected some links to advanced computer science courses on-line. I’m concentrating on courses with good lecture notes, rather than video lectures, and I’m applying a rather arbitrary filter for quality (otherwise this becomes a directory with less semantic utility). This is the good stuff! But only a subset of it – any recommendations for good courses are gratefully received. I’m mainly interested in systems, data-structures and mathematics, so reserve the right to choose topics at will. Courses are organised by broad topic. Advanced Computer Science Courses : Paper Trail
Zuse-Institut Berlin: Publikationen
We've asked What The Heck Are You Actually Using NoSQL For?. We've asked 101 Questions To Ask When Considering A NoSQL Database. We've even had a webinar What Should I Do?