background preloader

Cascading

Facebook Twitter

PMML 4.1 - General Structure. PMML uses XML to represent mining models.

PMML 4.1 - General Structure

The structure of the models is described by an XML Schema. One or more mining models can be contained in a PMML document. A PMML document is an XML document with a root element of type PMML. The general structure of a PMML document is: Cascading. Cascading Pattern is an extension to Cascading that provides various machine learning scoring algorithms and a utility for translating Predictive Model Markup Language (PMML) documents into applications on Apache Hadoop.

Cascading

Now you can deploy predictive models on to Hadoop or utilize the Cascading Pattern Java API to deploy your models or sophisticated ensembles. Pattern Benefits Quickly deploy machine scoring applications at scale on Apache Hadoop in as little as 4 lines of codeLeverage existing intellectual property in predictive models, and investments in predictive modeling tooling and core competenciesAccelerate application development and testingUnlock accessibility to Hadoop. Square/cascading-helpers. BloomJoin: BloomFilter + CoGroup. We recently open-sourced a number of internal tools we've built to help our engineers write high-performance Cascading code as the cascading_ext project.

BloomJoin: BloomFilter + CoGroup

Today I'm going to to talk about a tool we use to improve the performance of asymmetric joins—joins where one data set in the join contains significantly more records than the other, or where many of the records in the larger set don't share a common key with the smaller set. Asymmetric Joins A common MapReduce use case for us is joining a large dataset with a global set of records against a smaller one—for example, we have a store with billions of transactions keyed by user ID, and want to find all transactions by users who were seen within the last 24 hours.

Cascading/maple. Application Platform for Enterprise Big Data. Twitter/scalding. LiveRamp/cascading_ext.