background preloader

Distributed SQL Query Engine for Big Data

Distributed SQL Query Engine for Big Data

Zeppelin (Optional) Create Bootstrap Actions to Install Additional Software - Amazon Elastic MapReduce You can use a bootstrap action to install additional software and to change the configuration of applications on the cluster. Bootstrap actions are scripts that are run on the cluster nodes when Amazon EMR launches the cluster. They run before Hadoop starts and before the node begins processing data. You can create custom bootstrap actions, or use predefined bootstrap actions provided by Amazon EMR. A common use of bootstrap actions is to change the Hadoop configuration settings. Bootstrap actions execute as the Hadoop user by default. All Amazon EMR management interfaces support bootstrap actions. From the Amazon EMR console, you can optionally specify a bootstrap action while creating a cluster. When you use the CLI, you can pass references to bootstrap action scripts to Amazon EMR by adding the --bootstrap-action parameter when you create the cluster using the create-cluster command. --bootstrap-action Path= Use Predefined Bootstrap Actions Note <?

awslabs/emr-bootstrap-actions Installing Driven Beta on an Amazon EMR Master Node If you run your applications in an Amazon Web Services Elastic MapReduce (AWS EMR) cluster, use a bootstrap action to install the plugin. The bootstrapping works with both persistent EMR clusters as well as auto-terminating clusters. You can bootstrap by using either the AWS command-line interface (CLI) or the AWS Management Console. Bootstrapping installs the plugin on the EMR master node so that Cascading applications launched from the AWS CLI or the Management Console automatically operate with the plugin. Amazon Web Services Command-Line Interface The following code example shows you how to bootstrap Driven to an EMR cluster if you want to use the AWS CLI. The argument "–api-key,${DRIVEN_API_KEY}" appears in the following code. --bootstrap-actions Path= Amazon Web Services Management Console To bootstrap the Driven Plugin using the AWS Management Console:

mrjob — mrjob v0.4.5 documentation mrjob¶ mrjob lets you write MapReduce jobs in Python 2.6+/3.3+ and run them on several platforms. You can: Write multi-step MapReduce jobs in pure PythonTest on your local machineRun on a Hadoop clusterRun in the cloud using Amazon Elastic MapReduce (EMR)Run in the cloud using Google Cloud Dataproc (Dataproc)Easily run Spark jobs on EMR or your own Hadoop cluster mrjob is licensed under the Apache License, Version 2.0. To get started, install with pip: pip install mrjob and begin reading the tutorial below. Appendices Index Module Index Search Page Quick Links Need help?

Apache Drill - Schema-free SQL for Hadoop, NoSQL and Cloud Storage Run Spark and Spark SQL on Amazon Elastic MapReduce With the proliferation of data today, a common scenario is the need to store large data sets, process that data set iteratively, and discover insights using low-latency relational queries. Using the Hadoop Distributed File System (HDFS) and Hadoop MapReduce components in Apache Hadoop, these workloads can be distributed over a cluster of computers. By distributing the data and processing over many computers, your results return quickly even over large datasets because multiple computers share the load required for processing. Apache Spark, an open-source cluster computing system optimized for speed, can provide much faster performance and flexibility than Hadoop MapReduce. For those who don't want to use Scala or Python to process data with Spark, Spark SQL allows queries expressed in SQL and HiveQL, and can query data from a SchemaRDD or a table in the Hive metastore. The following diagram illustrates running Spark on a Hadoop cluster managed by Amazon EMR. ClusterId j-367J67T8QGKAD

emr-bootstrap-actions/spark at master · awslabs/emr-bootstrap-actions

Related: