background preloader

Hadoop

Facebook Twitter

Run Spark and Shark on Amazon Elastic MapReduce : Articles & Tutorials. A common business scenario is the need to store and query large data sets.

Run Spark and Shark on Amazon Elastic MapReduce : Articles & Tutorials

You can do this by running a data warehouse on a cluster of computers. By distributing the data over many computers, you return results quickly because the computers share the load of processing the query. One limitation on the speed at which queries can be returned, however, is the time it takes to retrieve the data from disk. You can increase the speed of queries returned from a data warehouse by using the Shark data warehouse system. Analyze Log Data with Apache Hive, Windows PowerShell, and Amazon EMR : Articles & Tutorials. The example used in this tutorial is known as contextual advertising, and it is one example of what you can do with Amazon Elastic MapReduce (Amazon EMR).

Analyze Log Data with Apache Hive, Windows PowerShell, and Amazon EMR : Articles & Tutorials

It is an adaptation of an earlier article at that used the Amazon EMR Command Line Interface (CLI) and the AWS Management Console instead of Windows PowerShell. Storing Logs on Amazon S3 An ad server produces two types of log files: impression logs and click logs. Every time the server displays an advertisement to a customer, it adds an entry to the impression log. Every time a customer clicks on an advertisement, it adds an entry to the click log. Parse Big Data with Informatica's HParser on Amazon EMR : Articles & Tutorials. Informatica's HParser is a tool you can use to extract data stored in heterogeneous formats and convert it into a form that is easy to process and analyze.

Parse Big Data with Informatica's HParser on Amazon EMR : Articles & Tutorials

For example, if your company has legacy stock trading information stored in custom-formatted text files, you could use HParser to read the text files and extract the relevant data as XML. Contextual Advertising using Apache Hive and Amazon EMR : Articles & Tutorials. Storing Logs on Amazon S3 The ad serving machines produce two types of log files: impression logs and click logs.

Contextual Advertising using Apache Hive and Amazon EMR : Articles & Tutorials

Every time we display an advertisement to a customer, we add an entry to the impression log. Every time a customer clicks on an advertisement, we add an entry to the click log. Every five minutes the ad serving machines push a log file containing the latest set of logs to Amazon S3. Node.js Streaming MapReduce with Amazon EMR : Articles & Tutorials.

Introduction Node.js is a JavaScript framework for running high performance server-side applications based upon non-blocking I/O and an asynchronous, event-driven processing model.

Node.js Streaming MapReduce with Amazon EMR : Articles & Tutorials

When customers need to process large volumes of complex data, Node.js offers a runtime that natively supports the JSON data structure.

Papers

Using Avro in MapReduce jobs with Hadoop, Pig, Hive. Apache Avro is a very popular data serialization format in the Hadoop technology stack.

Using Avro in MapReduce jobs with Hadoop, Pig, Hive

In this article I show code examples of MapReduce jobs in Java, Hadoop Streaming, Pig and Hive that read and/or write data in Avro format. AvroSerDe - Apache Hive. Earliest version AvroSerde is available Icon The AvroSerde is available in Hive 0.9.1 and greater.

AvroSerDe - Apache Hive

Overview - Working with Avro from Hive The AvroSerde allows users to read or write Avro data as Hive tables. The AvroSerde's bullet points: Infers the schema of the Hive table from the Avro schema.Reads all Avro files within a table against a specified schema, taking advantage of Avro's backwards compatibility abilitiesSupports arbitrarily nested schemas.Translates all Avro data types into equivalent Hive types. For general information about SerDes, see Hive SerDe in the Developer Guide. Requirements. Miguno/avro-hadoop-starter. Data Interoperability with Apache Avro. The ecosystem around Apache Hadoop has grown at a tremendous rate.

Data Interoperability with Apache Avro

Folks now can use many different pieces of software to process their large data sets, and most choose to use several of these components. Data collected by Flume might be analyzed by Pig and Hive scripts. Data imported with Sqoop might be processed by a MapReduce program. To facilitate these and other scenarios, data produced by each component must be readily consumed by other components. One might address this data interoperability in a variety of manners, including the following: Each system might be extended to read all the formats generated by the other systems. In practice all of these strategies will used to some extent.

Rcongiu/Hive-JSON-Serde. Pittsburgh nosql _ mapreduce by bearrito. Snowplow/snowplow. Hadoop Hive Tutorial – Zero to Results in 15 Minutes. This Tutorial will walk through using Hadoop’s Hive to access data stored in HDFS.

Hadoop Hive Tutorial – Zero to Results in 15 Minutes

This is the first tutorial (more to follow) and covers installing Hadoop as part of BigSQL and running some simple queries. The Platform: Linux (RHEL, CentOS or Ubuntu) or Mac OSX. The Hadoop and Postgres Install was from www.BigSQL.org. Takes about 10 – 15 minutes to install the developer bundle and get up and running. ( dont skip the pre-requisites ). The Data: The original source data is from the NYC finance department and contains Rolling Sales files for the last 12 months in New York City. If you have not previously installed bigsql, download the bundle and tar the file. Using DynamoDB with Amazon Elastic MapReduce : Articles & Tutorials. Apache Hadoop and NoSQL databases are complementary technologies that together provide a powerful toolbox for managing, analyzing, and monetizing Big Data.

Using DynamoDB with Amazon Elastic MapReduce : Articles & Tutorials

That's why we were so excited to provide out-of-the-box Amazon Elastic MapReduce (Amazon EMR) integration with Amazon DynamoDB, providing customers an integrated solution that eliminates the often prohibitive costs of administration, maintenance, and upfront hardware. Customers can now move vast amounts of data into and out of DynamoDB, as well as perform sophisticated analytics on that data, using EMR's highly parallelized environment to distribute the work across the number of servers of their choice. Further, as EMR uses a SQL-based engine for Hadoop called Hive, you need only know basic SQL while we handle distributed application complexities such as estimating ideal data splits based on hash keys, pushing appropriate filters down to DynamoDB, and distributing tasks across all the instances in your EMR cluster.

Pig Queries Parsing JSON on Amazons Elastic Map Reduce Using S3 Data. I know the title of this post is a mouthful, but it’s the fun of pushing envelope of existing technologies. Enron-avro/enron.pig at master · rjurney/enron-avro. Hadoop Tutorial. Introduction Welcome to the Yahoo! Hadoop tutorial! This series of tutorial documents will walk you through many aspects of the Apache Hadoop system. You will be shown how to set up simple and advanced cluster configurations, use the distributed file system, and develop complex Hadoop MapReduce applications. Other related systems are also reviewed. Goals for this Module: Understand the scope of problems applicable to Hadoop Understand how Hadoop addresses these problems differently from other distributed systems. Outline Problem Scope. Hadoop Tutorial. Introduction Hadoop by itself allows you to store and process very large volumes of data. However, building a large-scale distributed system can require functionality not provided by this base.

Several other tools and systems have been created to fill the gaps and deliver a more full-featured set of distributed systems engineering tools. AWS Redshift: How Amazon Changed The Game – AK Tech Blog. Edit: Thank you to Curt Monash who points out that Netezza is available for as little as $20k/TB/year with hardware (and 2.25x compression) and that there is an inconsistency in my early price estimates and the fraction I quote in my conclusion. I’ve incorporated his observations into my corrections below. I’ve also changed a sentence in the conclusion to make the point that the $5k/TB/year TCO number is the effective TCO given that a Redshift cluster that can perform these queries at the desired speed has far more storage than is needed to just hold the tables for the workloads I tested. Author’s Note: I’ll preface this post with a warning: some of the content will be inflammatory if you go into it with the mindset that I’m trying to sell you on an alternative to Hadoop.

I’m not. I’m here to talk about how an MPP system blew us away, as jaded as we are, and how it is a sign of things to come. Our first question was: “How much are we willing to spend?” Objective.