background preloader

Welcome to Apache Flume — Apache Flume

Welcome to Apache Flume — Apache Flume
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application. News July 2, 2013 - Apache Flume 1.4.0 Released

Related:  Logstash - Kinesis - Fluentd - Flume tools2.1 Data Clean

Using Elasticsearch on Amazon EC2 elasticsearch • amazon-ec2 Elasticsearch is a distributed search server offering powerful search functionality over schema-free documents in (near) real time. All of this functionality is exposed via a RESTful JSON API. It's built on top of Apache Lucene and like all great projects it's open source. Update: I've updated this post to be compatible with recent versions of Elasticsearch. Syncsort - Resource Center Delivering Smarter ETL Through Hadoop Most organizations are using Hadoop to collect, process and distribute data – which is actually ETL (Extract, Transform and Load). But current ETL tools don’t deliver on Hadoop. They aren’t...

gource - software version control visualization Gource is a software version control visualization tool. See more of Gource in action on the Videos page. Introduction Software projects are displayed by Gource as an animated tree with the root directory of the project at its centre. Open sourcing Databus: LinkedIn's low latency change data capture system Co-authors: Sunil Nagaraj, Shirshanka Das, Kapil Surlaker We are pleased to announce the open source release of Databus - a real-time change data capture system. Originally developed in 2005, Databus has been in production in its latest revision at Linkedin since 2011. The Databus source code is available in our github repo for you to get started! What is Databus?

Logstash Plugin for Amazon DynamoDB The Logstash plugin for Amazon DynamoDB gives you a nearly real-time view of the data in your DynamoDB table. The Logstash plugin for DynamoDB uses DynamoDB Streams to parse and output data as it is added to a DynamoDB table. After you install and activate the Logstash plugin for DynamoDB, it scans the data in the specified table, and then it starts consuming your updates using Streams and then outputs them to Elasticsearch, or a Logstash output of your choice. Logstash is a data pipeline service that processes data, parses data, and then outputs it to a selected location in a selected format. Elasticsearch is a distributed, full-text search server. For more information about Logstash and Elasticsearch, go to

Student's t-test A t-test is any statistical hypothesis test in which the test statistic follows a Student's t-distribution if the null hypothesis is supported. It can be used to determine if two sets of data are significantly different from each other, and is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is unknown and is replaced by an estimate based on the data, the test statistic (under certain conditions) follows a Student's t distribution. History[edit] The t-statistic was introduced in 1908 by William Sealy Gosset, a chemist working for the Guinness brewery in Dublin, Ireland ("Student" was his pen name).[1][2][3][4] Gosset had been hired due to Claude Guinness's policy of recruiting the best graduates from Oxford and Cambridge to apply biochemistry and statistics to Guinness's industrial processes.[2] Gosset devised the t-test as a cheap way to monitor the quality of stout.

Red Hat Enterprise Linux Server Red Hat® Enterprise Linux® servers handle millions of dollars in trades, purchases, and analysis every day. Surprised? Don't be. With support for all major hardware platforms and thousands of commercial and custom applications, Red Hat Enterprise Linux is the new standard for enterprise datacenters. More Built for the modern datacenter Setting Up for Amazon Kinesis - Amazon Kinesis Before you use Amazon Kinesis for the first time, complete the following tasks. When you sign up for Amazon Web Services (AWS), your AWS account is automatically signed up for all services in AWS, including Amazon Kinesis. You are charged only for the services that you use. Introducing Morphlines: The Easy Way to Build and Integrate ETL Apps for Hadoop This post is the first in a series of blog posts about Cloudera Morphlines, a new command-based framework that simplifies data preparation for Apache Hadoop workloads. To check it out or help contribute, you can find the code here. Cloudera Morphlines is a new open source framework that reduces the time and effort necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards. If you want to integrate, build, or facilitate transformation pipelines without programming and without substantial MapReduce skills, and get the job done with a minimum amount of fuss and support costs, this post gets you started. A “morphline” is a rich configuration file that makes it easy to define a transformation chain that consumes any kind of data from any kind of data source, processes the data, and loads the results into a Hadoop component. Processing Model

Big data in minutes with the ELK Stack We’ve built a data analysis and dashboarding infrastructure for one of our clients over the past few weeks. They collect about 10 million data points a day. Yes, that’s big data. Data Warehousing and Business Intelligence DW Books Disappointed with the Google search result of “data warehousing books”, I try to put all data warehousing books that I know into this page. It is totally understandable why Google’s search result don’t include ETL or Dimensional Modeling, for example. Same thing with Amazon, see Note 1 below. Even data warehouse books as important as Inmon’s DW 2.0 was missed because the title doesn’t contain the word “Warehouse”.

Collecting Logs into Elasticsearch and S3 Elasticsearch is an open sourcedistributed real-time search backend. While Elasticsearch can meet a lot of analytics needs, it is best complemented with other analytics backends like Hadoop and MPP databases. As a "staging area" for such complementary backends, AWS's S3 is a great fit.

Data Wrangler UPDATE: The Stanford/Berkeley Wrangler research project is complete, and the software is no longer actively supported. Instead, we have started a commercial venture, Trifacta. For the most recent version of the tool, see the free Trifacta Wrangler. Why wrangle? Too much time is spent manipulating data just to get analysis and visualization tools to read it.

A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. by sergeykucherov Jul 15