background preloader

Storm, distributed and fault-tolerant realtime computation

Storm, distributed and fault-tolerant realtime computation

Building An Open Source, Distributed Google Clone Disclosure: the writer of this article, Emre Sokullu, joined Hakia as a Search Evangelist in March 2007. The following article in no way represents Hakia's views - it is Emre's personal opinions only. Google is like a young mammoth, already very strong but still growing. Healthy quarter results and rising expectations in the online advertising space are the biggest factors for Google to keep its pace in NASDAQ. But now let's think outside the square and try to figure out a Google killer scenario. You may know that I am obsessed with open source (e.g. my projects openhuman and simplekde), so my proposition will be open source based - and I'll call it Google@Home. First let me define what my concept of Google@Home is. Comparison to Wikiasari The distributed nature of the engine is what makes it different from Wikipedia co-founder Jimmy Wales' Wikiasari project, which is an open source wiki-inspired search engine. Why an open source search engine? Who would create an open source Google clone?

The Scala Programming Language Kafka Prior releases: 0.7.x, 0.8.0. 1. Getting Started 1.1 Introduction Kafka is a distributed, partitioned, replicated commit log service. What does all that mean? First let's review some basic messaging terminology: Kafka maintains feeds of messages in categories called topics. Communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. Topics and Logs Let's first dive into the high-level abstraction Kafka provides—the topic. A topic is a category or feed name to which messages are published. Each partition is an ordered, immutable sequence of messages that is continually appended to—a commit log. The Kafka cluster retains all published messages—whether or not they have been consumed—for a configurable period of time. In fact the only metadata retained on a per-consumer basis is the position of the consumer in in the log, called the "offset". The partitions in the log serve several purposes. Distribution Producers Consumers Guarantees

Ruby Multithreading Traditional programs have a single thread of execution: the statements or instructions that comprise the program are executed sequentially until the program terminates. A multithreaded program has more than one thread of execution. Within each thread, statements are executed sequentially, but the threads themselves may be executed in parallel on a multicore CPU, for example. Ruby makes it easy to write multi-threaded programs with the Thread class. Creating Ruby Threads: To start a new thread, just associate a block with a call to Thread.new. # Thread #1 is running hereThread.new { # Thread #2 runs this code}# Thread #1 runs this code Example: Here is an example, which shows how we can use multi-threaded Ruby program. This will produce following result: Thread Lifecycle: A new threads are created with Thread.new. There is no need to start a thread after creating it, it begins running automatically when CPU resources become available. Threads and Exceptions: Thread Variables: #! Thread Priorities:

Social Networking 3.0: From Self-expression to Group Action My favorite social networking site is one that makes $10B of revenues/year, has no infrastructure costs, and has no salesforce, has no management team. Can you guess which one it is? I can't tell you. It's invite only. You'd know if you knew. Our site is different than others, in that it's owned entirely by its users. That's right, this open source community makes nearly $10B in revenues per year, with room to grow to $50B. Advertisers spend about $2000/person in advertising per year. So how do we build our own ad network? The first is easier. How do we determine who sees what ad? No data centers, no sales force, no infrastructure costs and $10B of revenues. That's the beauty of my new social network. What do we do with all the money?

Spark, an alternative for fast data analytics Spark is an open source cluster computing environment similar to Hadoop, but it has some useful differences that make it superior in certain workloads—namely, Spark enables in-memory distributed datasets that optimize iterative workloads in addition to interactive queries. Spark is implemented in the Scala language and uses Scala as its application framework. Unlike Hadoop, Spark and Scala create a tight integration, where Scala can easily manipulate distributed datasets as locally collective objects. Although Spark was created to support iterative jobs on distributed datasets, it's actually complementary to Hadoop and can run side by side over the Hadoop file system. Spark cluster computing architecture Although Spark has similarities to Hadoop, it represents a new cluster computing framework with useful differences. Spark also introduces an abstraction called resilient distributed datasets (RDDs). Figure 1. Spark programming model Back to top Brief introduction to Scala Scala illustrated

For fast, interactive Hadoop queries, Drill may be the answer — Cloud Computing News grosser/parallel

Related: