Performance and Optimization

> > >

Guidelines for Working with Tables - Amazon DynamoDB. This section covers some best practices for working with tables.

Design For Uniform Data Access Across Items In Your Tables The optimal usage of a table's provisioned throughput depends on these factors: The primary key selection.The workload patterns on individual items. The primary key uniquely identifies each item in a table. The primary key can be simple (partition key) or composite (partition key and sort key). When it stores data, DynamoDB divides a table's items into multiple partitions, and distributes the data primarily based upon the partition key value. Total Provisioned Throughput / Partitions = Throughput Per Partition Consequently, to achieve the full amount of request throughput you have provisioned for a table, keep your workload spread evenly across the partition key values. This does not mean that you must access all of the partition key values to achieve your throughput level; nor does it mean that the percentage of accessed partition key values needs to be high.

Note. The Simplest Explanation of and Approaches to Optimizing Spark Shuffles. Written by Bill Chambers on Saturday, 26-Sep-.

Motivation If you're coming across the post on the internet, you've likely been using Spark and have been looking at how you can optimize Spark code. You might have wandered across the Spark Shuffle Internals Documentation written by Kay Ousterhout as well as a presentation or two about it. I wanted to write this all down to explain it in simple terms in order to aid your understanding. I'm going to gloss over some of the more implementation-y details and focus on the high level takeaways.

Deep Dive into Spark SQL’s Catalyst Optimizer. Spark SQL is one of the newest and most technically involved components of Spark.

It powers both SQL queries and the new DataFrame API. At the core of Spark SQL is the Catalyst optimizer, which leverages advanced programming language features (e.g. Scala’s pattern matching and quasiquotes) in a novel way to build an extensible query optimizer. We recently published a paper on Spark SQL that will appear in SIGMOD 2015 (co-authored with Davies Liu, Joseph K.

Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. To implement Spark SQL, we designed a new extensible optimizer, Catalyst, based on functional programming constructs in Scala. At its core, Catalyst contains a general library for representing trees and applying rules to manipulate them. Trees The main data type in Catalyst is a tree composed of node objects. As a simple example, suppose we have the following three node classes for a very simple expression language: Rules Applying this to the tree for x+(1+2) would yield the new tree x+3. Tune Your Apache Spark Jobs (Part 1) - Cloudera Engineering Blog. Learn techniques for tuning your Apache Spark jobs for optimal efficiency.

Tune Your Apache Spark Jobs (Part 1) - Cloudera Engineering Blog

When you write Apache Spark code and page through the public APIs, you come across words like transformation, action, and RDD. Understanding Spark at this level is vital for writing Spark programs. Similarly, when things start to fail, or when you venture into the web UI to try to understand why your application is taking so long, you’re confronted with a new vocabulary of words like job, stage, and task. Project16 report. Untitled. Optimization - spark.mllib - Spark 1.6.0 Documentation. Mathematical description Gradient descent The simplest method to solve optimization problems of the form minw∈Rdf(w) is gradient descent.

Optimization - spark.mllib - Spark 1.6.0 Documentation

Such first-order optimization methods (including gradient descent and stochastic variants thereof) are well-suited for large-scale and distributed computation. Gradient descent methods aim to find a local minimum of a function by iteratively taking steps in the direction of steepest descent, which is the negative of the derivative (called the gradient) of the function at the current point, i.e., at the current parameter value. If the objective function f is not differentiable at all arguments, but still convex, then a sub-gradient is the natural generalization of the gradient, and assumes the role of the step direction.

Stochastic gradient descent (SGD) Optimization problems whose objective function f is written as a sum are particularly suitable to be solved using stochastic gradient descent (SGD). Tuning Spark - Spark 1.2.0 Documentation. Because of the in-memory nature of most Spark computations, Spark programs can be bottlenecked by any resource in the cluster: CPU, network bandwidth, or memory.

Tuning Spark - Spark 1.2.0 Documentation

Most often, if the data fits in memory, the bottleneck is network bandwidth, but sometimes, you also need to do some tuning, such as storing RDDs in serialized form, to decrease memory usage. This guide will cover two main topics: data serialization, which is crucial for good network performance and can also reduce memory use, and memory tuning. We also sketch several smaller topics. Serialization plays an important role in the performance of any distributed application. Formats that are slow to serialize objects into, or consume a large number of bytes, will greatly slow down the computation. Java serialization: By default, Spark serializes objects using Java’s ObjectOutputStream framework, and can work with any class you create that implements java.io.Serializable. Determining Memory Consumption Tuning Data Structures.