background preloader


Facebook Twitter

Streaming Data

HAWQ. Pivotal. Prescient Transforms 48,000+ Data Sources in Real Time with Apache NiFi. Apache Aurora. The Dawn of the Orchestration-for-All Era: Docker welcomes the creators of the Aurora project. Ambari/ at trunk · apache/ambari. Replace "Book Title" in the <h4> tag at the top of the navigation links with the applicable book title. Ambari User's Guide Overview Hadoop is a large-scale, distributed data storage and processing infrastructure using clusters of commodity hosts networked together.

Replace "Book Title" in the <h4> tag at the top of the navigation links with the applicable book title.

Monitoring and managing such complex distributed systems is a non-trivial task.


Avro. Kafka. Spark. The One-Stop Shop for Big Data. The Hadoop Ecosystem: HDFS, Yarn, Hive, Pig, HBase and Growing. Hadoop is the leading open-source software framework developed for scalable, reliable and distributed computing.

The Hadoop Ecosystem: HDFS, Yarn, Hive, Pig, HBase and Growing

With the world producing data in the zettabyte range there is a growing need for cheap, scalable, reliable and fast computing to process and make sense of all of this data. The underlying technology for Hadoop framework was created by Google as there was no software in the market that fit Google needs. Indexing the web and analysing search patterns required deep and computationally extensive analytics that would help Google to improve their user behaviour algorithms. Hadoop is built just for that as it runs on a large number of machines that share the workload to optimise performance. Moreover, Hadoop replicates the data throughout the machines ensuring that the processing of data will not be disrupted if one or multiple machines stop working.

Galway Data Meetup Mesos Talk - Google Slides. Mesos. Mesos/kafka. Building and Deploying Application to Apache Mesos. A Closer Look at RDDs. WebHDFS REST API. Document Conventions Introduction Authentication When security is off, the authenticated user is the username specified in the query parameter.


If the parameter is not set, the server may either set the authenticated user to a default web user, if there is any, or return an error response. When security is on, authentication is performed by either Hadoop delegation token or Kerberos SPNEGO. Below are examples using the curl command tool. Home. ODPi: the open ecosystem of big data. Tips and Tricks for Running Spark On Hadoop, Part 3: RDD Persistence - Altiscale. In Parts 1 and 2 of this blog series we discussed Spark execution modes and how to troubleshoot problems and exceptions in Spark applications, respectively.

Tips and Tricks for Running Spark On Hadoop, Part 3: RDD Persistence - Altiscale

In this blog, Part 3, we’re going to discuss how to increase performance through resilient distributed dataset (RDD) persistence. Spark revolves around the concept of an RDD, which is a fault-tolerant collection of elements that can be operated on in parallel. RDDs are one of the cornerstones of Spark architecture, and configuring them properly is essential for optimal Spark performance.

Here, we’ll explain various RDD persistence options and delve a bit more deeply into one commonly used option—memory-only. Storage Persistence Options for RDDs Recent Spark versions such as 1.4.1 offer the following storage persistence options for RDDs: Splice Machine - The Hadoop RDBMS. YCSB, the Open Standard for NoSQL Benchmarking, Joins Cloudera Labs. YCSB, the open standard for comparative performance evaluation of data stores, is now available to CDH users for their Apache HBase deployments via new packages from Cloudera Labs.

YCSB, the Open Standard for NoSQL Benchmarking, Joins Cloudera Labs

Many factors go into deciding which data store should be used for production applications, including basic features, data model, and the performance characteristics for a given type of workload. It’s critical to have the ability to compare multiple data stores intelligently and objectively so that you can make sound architectural decisions. The Yahoo! Cloud Serving Benchmark (YCSB), an open source framework for evaluating and comparing the performance of multiple types of data-serving systems (including NoSQL stores such as Apache HBase, Apache Cassandra, Redis, MongoDB, and Voldemort), has long been the de facto open standard for this purpose.

Tabula: Extract Tables from PDFs. Flafka: Apache Flume Meets Apache Kafka for Event Processing. The new integration between Flume and Kafka offers sub-second-latency event processing without the need for dedicated infrastructure.

Flafka: Apache Flume Meets Apache Kafka for Event Processing

In this previous post you learned some Apache Kafka basics and explored a scenario for using Kafka in an online application. This post takes you a step further and highlights the integration of Kafka with Apache Hadoop, demonstrating both a basic ingestion capability as well as how different open-source components can be easily combined to create a near-real time stream processing workflow using Kafka, Apache Flume, and Hadoop. Inside Santander’s Near Real-Time Data Ingest Architecture. Learn about the near real-time data ingest architecture for transforming and enriching data streams using Apache Flume, Apache Kafka, and RocksDB at Santander UK.

Inside Santander’s Near Real-Time Data Ingest Architecture

Cloudera Professional Services has been working with Santander UK to build a near real-time (NRT) transactional analytics system on Apache Hadoop. The objective is to capture, transform, enrich, count, and store a transaction within a few seconds of a card purchase taking place. The system receives the bank’s retail customer card transactions and calculates the associated trend information aggregated by account holder and over a number of dimensions and taxonomies.

Designing Fraud-Detection Architecture That Works Like Your Brain Does. To design effective fraud-detection architecture, look no further than the human brain (with some help from Spark Streaming and Apache Kafka).

Designing Fraud-Detection Architecture That Works Like Your Brain Does

At its core, fraud detection is about detection whether people are behaving “as they should,” otherwise known as catching anomalies in a stream of events. This goal is reflected in diverse applications such as detecting credit-card fraud, flagging patients who are doctor shopping to obtain a supply of prescription drugs, or identifying bullies in online gaming communities. To understand how to design an effective fraud-detection architecture, one need to examine how the human brain learns to detect anomalies and react to them. As it turns out, our brains have multiple systems for analyzing information. Genome Bioinformatics: FAQ. The multiple alignment format stores a series of multiple alignments in a format that is easy to parse and relatively easy to read.

Genome Bioinformatics: FAQ

This format stores multiple alignments at the DNA level between entire genomes. Previously used formats are suitable for multiple alignments of single proteins or regions of DNA without rearrangements, but would require considerable extension to cope with genomic issues such as forward and reverse strand directions, multiple pieces to the alignment, and so forth. General Structure The .maf format is line-oriented. Securely Explore Your Data. Polyglot Processing. The story begins with Neal Ford's 2006 post on polygot programming.

Polyglot Processing

People started thinking about when to use what kind of programming language: weakly-typed script languages vs. strongly-typed compiled ones or functional vs. object-oriented languages, etc.: Applications of the future will take advantage of the polyglot nature of the language world. … We should embrace this idea. … It's all about choosing the right tool for the job and leveraging it correctly.

Then, in 2011, Martin Fowler came along and coined the term polyglot persistence causing people to think about if a relational database is truly the best fit to store their data for any kind of workload or use case. The Apache Spark Stack. Lambda Architecture » λ Lambda Architecture. Making Sense of it All Building a well-designed, reliable and functional big data application that caters to a variety of end-user latency requirements can be an extremely challenging proposition.

It can be daunting enough to just keep up with the rapid pace of technology innovation happening in this space, let alone building applications that work for the problem at hand. “Start slow and build one application at a time” is perhaps the most common advice given to beginners today. However, there are certain high-level architectural constructs that can help you mentally visualize how different types of applications fit into the big data architecture and how some of these technologies are transforming the existing enterprise software landscape.

Lambda Architecture Lambda Architecture is a useful framework to think about designing big data applications. Overview of the Lambda Architecture. Onurakpolat/awesome-bigdata. Hadoop 2 hive cloud palo alto. Spark for Data Science: A Case Study. I’m a pretty heavy Unix user and I tend to prefer doing things the Unix Way™, which is to say, composing many small command line oriented utilities. With composability comes power and with specialization comes simplicity. Although, sometimes if two utilities are used all the time, sometimes it makes sense for either: A utility that specializes in a very common use-caseOne utility to provide basic functionality from another utility.


HBase. Hadoop. Three approaches to parallelizing data transformation. August 26, 2008.