background preloader


Facebook Twitter

Distributed Copy Using S3DistCp - Amazon Elastic MapReduce. This example loads Amazon CloudFront logs into HDFS by adding a step to a running cluster.

Distributed Copy Using S3DistCp - Amazon Elastic MapReduce

In the process it changes the compression format from Gzip (the CloudFront default) to LZO. This is useful because data compressed using LZO can be split into multiple maps as it is decompressed, so you don't have to wait until the compression is complete, as you do with Gzip. This provides better performance when you analyze the data using Amazon EMR. This example also improves performance by using the regular expression specified in the --groupBy option to combine all of the logs for a given hour into a single file. Amazon EMR clusters are more efficient when processing a few, large, LZO-compressed files than when processing many, small, Gzip-compressed files. To load Amazon CloudFront logs into HDFS, type the following command, replace j-3GYXXXXXX9IOK with your cluster ID, and replace mybucket with your Amazon S3 bucket name. How to Enable Consistent View - Amazon Elastic MapReduce. You can enable Amazon S3 server-side encryption or consistent view for EMRFS using the AWS Management Console, AWS CLI, or you can use a bootstrap action to configure additional settings for EMRFS.

How to Enable Consistent View - Amazon Elastic MapReduce

To configure consistent view using the console Choose Create Cluster. Navigate to the File System Configuration section.To enable Consistent view, choose Enabled. For EMRFS Metadata store, type the name of your metadata store. The default value is EmrFSMetadata. To launch a cluster with consistent view enabled using the AWS CLI. Apache Spark on Amazon EMR. My colleague Jon Fritz wrote the guest post below to introduce a powerful new feature for Amazon EMR. — Jeff; I’m happy to announce that Amazon EMR now supports Apache Spark.

Apache Spark on Amazon EMR

Amazon EMR is a web service that makes it easy for you to process and analyze vast amounts of data using applications in the Hadoop ecosystem, including Hive, Pig, HBase, Presto, Impala, and others. We’re delighted to officially add Spark to this list. Although many customers have previously been installing Spark using custom scripts, you can now launch an Amazon EMR cluster with Spark directly from the Amazon EMR Console, CLI, or API.

Apache Spark: Beyond Hadoop MapReduce We have seen great customer successes using Hadoop MapReduce for large scale data processing, batch reporting, ad hoc analysis on unstructured data, and machine learning. By using a directed acyclic graph (DAG) execution engine, Spark can create a more efficient query plan for data transformations. For the second step, add a Spark application step: Do Data Quality Checks using Apache Spark DataFrames. Apache Spark’s ability to support data quality checks via DataFrames is progressing rapidly.

Do Data Quality Checks using Apache Spark DataFrames

This post explains the state of the art and future possibilities. Apache Hadoop and Apache Spark make Big Data accessible and usable so we can easily find value, but that data has to be correct, first. This post will focus on this problem and how to solve it with Apache Spark 1.3 and Apache Spark 1.4 using DataFrames. (Note: although relatively new to Spark and thus not yet supported by Cloudera at the time of this writing, DataFrames are highly worthy of exploration and experimentation. Learn more about Cloudera’s support for Apache Spark here.) Why Spark 1.3 and 1.4?

The great thing about showing both implementations is that it emphasizes the fact that within only a couple of months, Spark has made data processing even more accessible and easier to explore—which makes me super excited about the future of Apache Spark. Experimenting with the Spark Shell and S3 read write. Word-count exercise with Spark on Amazon EMR – skipperkongen. This is a mini-workshop that shows you how to work with Spark on Amazon Elastic Map-Reduce; It’s a kind of hello world of Spark on EMR.

Word-count exercise with Spark on Amazon EMR – skipperkongen

We will solve a simple problem, namely use Spark and Amazon EMR to count the words in a text file stored in S3. To follow along you will need the following: Create some test data in S3 We will count the words in the U.S. constitution, more specifically count the words in a text file that I have found online. Step one is to upload this file to Amazon S3, so that the Spark cluster (created in next section) can access it. Download the file locally first: Amazon ec2 - How to read input from S3 in a Spark Streaming EC2 cluster application. Databricks Spark Reference Applications. S3 is Amazon Web Services's solution for storing large files in the cloud.

Databricks Spark Reference Applications

On a production system, you want your Amazon EC2 compute nodes on the same zone as your S3 files for speed as well as cost reasons. While S3 files can be read from other machines, it would take a long time and be expensive (Amazon S3 data transfer prices differ if you read data within AWS vs. to somewhere else on the internet). See running Spark on EC2 if you want to launch a Spark cluster on AWS - charges apply. If you choose to run this example with a local Spark cluster on your machine rather than EC2 compute nodes to read the files in S3, use a small data input source! Sign up for an Amazon Web Services Account.Load example log files to s3.Log into the AWS console for S3Create an S3 bucket.Upload a couple of example log files to that bucket.Your files will be at the path: your security credentials for AWS: Now, run passing in the s3n path to your files.