background preloader

Hadoop

Facebook Twitter

Use H2O and data.table to build models on large data sets in R. Introduction Last week, I wrote an introductory article on the package data.table.

Use H2O and data.table to build models on large data sets in R

It was intended to provide you a head start and become familiar with its unique and short syntax. The next obvious step is to focus on modeling, which we will do in this post today. PMML and Hadoop. Reference to Hadoop implies huge amount of data.

PMML and Hadoop

The intend of the data is of course to derive insights that will help businesses stay competitive. "Scoring" the data is a common exercise in determining e.g. customer churn, fraud detection, risk mitigation, etc... It is one of the slowest analytics activities and especially when very large data set is involved. There are various fast scoring products in the market but they are very specialized and/or are provided by one vendor, usually requiring the entire scoring process to be done using its tools set. This poses a problem for those who build their scoring model using tools other than that of the scoring engine vendor. Stinger and Tez: A primer -Big Data Analytics News. What is Stinger?

Stinger and Tez: A primer -Big Data Analytics News

The Stinger initiative aims to redesign Hive to make it what people want today: Hive is currently used for large batch jobs and works great in that sense; but people also want interactive queries, and Hive is too slow today for this. So a big drive is performance, aiming to be 100x faster. Apache Hadoop YARN: Present and Future. Hadoop Tutorial: Intro to HDFS. Upgrading Hortonworks HDP from 2.0 to 2.1. Today I have upgraded my personal HDP cluster from version 2.0 to version 2.1.

Upgrading Hortonworks HDP from 2.0 to 2.1

The cluster runs completely on a CentOS 6 VM on my notebook, so it just consists of one single node hosting the namenode, datanode and all other services. Connecting SAP DataServcies to Hadoop: HDFS vs.... SAP DataServices (DS) supports two ways of accessing data on a Hadoop cluster: HDFS: DS reads directly HDFS files from Hadoop.

Connecting SAP DataServcies to Hadoop: HDFS vs....

In DataServices you need to create HDFS file formats in order to use this setup. Depending on your dataflow DataServices might read the HDFS file directly into the DS engine and then handle the further processing in its own engine. If your dataflow contains more logic that could be pushed down to Hadoop, DS may as well generate a Pig script. The Pig script will then not just read the HDFS file but also handle other transformations, aggregations etc. from your dataflow.The latter scenario is usually a preferred setup for large amount of data because this way the Hadoop cluster can provide processing power of many Hadoop nodes on inexpensive commodity hardware.

Connecting SAP DataServices to Hadoop Hive. Connecting SAP DataServices to Hadoop Hive is not as simple as connecting to a relational database for example.

Connecting SAP DataServices to Hadoop Hive

In this post I want to share my experiences on how to connect DataServices (DS) to Hive. The DS engine cannot connect to Hive directly. Instead you need to configure a Hive adapter from the DS management console which will actually manage the connection to Hive. In the rest of this post I will assume the following setup: DataServices Text Analysis and Hadoop - the Det... I have already used the text analysis feature within SAP DataServices in various projects (the transform in DataServices is called Text Data Processing or TDP in short).

DataServices Text Analysis and Hadoop - the Det...

Usually, the TDP transform runs in the DataServices engine, means that DataServices first loads the source text in its own memory and then runs the text analysis on its own server / engines. The text sources are usually unstructured text or binary files such as Word, Excel, PDF files etc. Business Intelligence: An Integration of Apache. Last week was very busy for attendees at SAP TechEd Las Vegas, so fortunately SAP has made some recordings of sessions available.

Business Intelligence: An Integration of Apache

I watched An Integration of Apache Hadoop, SAP HANA, and SAP BusinessObjects, session EA204 today with SAP's Anthony Waite. Text Analysis came up a few times last week plus I am familiar with Data Services Text Analysis features. First, a review of how text mining fits in. Figure 1: Source: SAP Figure 1 shows we have "lots of unstructured data", with 80% unstructured. Unstructured gets messy; think of a MS Doc file – can you run that through your system/process? The hot topic is social networks and analyzing for sentiment analysis. Customer preferences can be mined as an example.

Don't use Hadoop - your data isn't that big - Chris Stucchio. "So, how much experience do you have with Big Data and Hadoop?

Don't use Hadoop - your data isn't that big - Chris Stucchio

" they asked me. I told them that I use Hadoop all the time, but rarely for jobs larger than a few TB. I'm basically a big data neophite - I know the concepts, I've written code, but never at scale. To Hadoop or Not to Hadoop? Hadoop is very popular, but is not a solution for all Big Data cases.

To Hadoop or Not to Hadoop?

Here are the questions to ask to determine if Hadoop is right for your problem. Guest blog By Anand Krishnaswamy, ThoughtWorks, Oct 4, 2013. Busting 10 myths about Hadoop. Although Hadoop and related technologies have been with us for more than five years now, most BI professionals and their business counterparts still harbor a few misconceptions that need to be corrected about Hadoop and related technologies such as MapReduce. The following list of 10 facts will clarify what Hadoop is and does relative to BI/DW, as well as in which business and technology situations Hadoop-based business intelligence (BI), data warehousing (DW), data integration (DI), and analytics can be useful. Fact No. 1: Hadoop consists of multiple products We talk about Hadoop as if it's one monolithic thing, but it's actually a family of open-source products and technologies overseen by the Apache Software Foundation (ASF).

(Some Hadoop products are also available via vendor distributions; more on that later.) The Apache Hadoop library includes (in BI priority order): the Hadoop Distributed File System (HDFS), MapReduce, Pig, Hive, HBase, HCatalog, Ambari, Mahout, Flume, and so on. Ad-hoc query on Hadoop (TDWI_Infobrigth) GettingStarted - Apache Hive. Table of Contents Installation and Configuration You can install a stable release of Hive by downloading a tarball, or you can download the source code and build Hive from that.

Requirements.