background preloader

Big data

Facebook Twitter


Data resources

Extend, visualize and share data online. Meet the combo powering Hadoop at Etsy, Airbnb and Climate Corp. — Data. Hadoop doesn’t have to be so hard, just ask Etsy, Airbnb and the Climate Corporation.

Meet the combo powering Hadoop at Etsy, Airbnb and Climate Corp. — Data

All three, it turns out, are using the Cascading framework atop Amazon Web Services’ Elastic MapReduce service to make creating and running big data jobs simpler than is possible using Hadoop alone. Cascading is an open source Java framework that acts as an intermediary between users and Hadoop. Users create data workflows using Cascading’s Java-compatible APIs (rather than writing Hadoop MapReduce jobs), and it handles the task of making Hadoop process the data.

Cascading is backed by a commercial entity called Concurrent (see disclosure), which is headed up by creator Chris Wensel, and is the foundation of several variations including Cascalog (a Clojure-based query language for Hadoop) and Scalding (Twitter’s Scala API for Hadoop). They’re hardly the only options for simplifying the Hadoop process, though. Feature image courtesy of Shutterstock user JTP. SQLstream. Data Science Toolkit. Advanced Reporting & Analysis for Big Data. Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase comparison. (Yes it's a long title, since people kept asking me to write about this and that too :) I do when it has a point.)

Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase comparison

While SQL databases are insanely useful tools, their monopoly in the last decades is coming to an end. And it's just time: I can't even count the things that were forced into relational databases, but never really fitted them. (That being said, relational databases will always be the best for the stuff that has relations.) But, the differences between NoSQL databases are much bigger than ever was between one SQL database and another.

This means that it is a bigger responsibility on software architects to choose the appropriate one for a project right at the beginning. In this light, here is a comparison of Cassandra, Mongodb, CouchDB, Redis, Riak, Couchbase (ex-Membase), Hypertable, ElasticSearch, Accumulo, VoltDB, Kyoto Tycoon, Scalaris, OrientDB, Aerospike, Neo4j and HBase: The most popular ones Redis (V3.0RC) For example: To store real-time stock prices. MongoDB (2.6.7) MongoDB. Welcome to Apache™ Hadoop™! Home - Apache Hive. The Apache HiveTM data warehouse software facilitates querying and managing large datasets residing in distributed storage.

Home - Apache Hive

Built on top of Apache HadoopTM, it provides Tools to enable easy data extract/transform/load (ETL)A mechanism to impose structure on a variety of data formatsAccess to files stored either directly in Apache HDFSTM or in other data storage systems such as Apache HBaseTM Query execution via MapReduce Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language. QL can also be extended with custom scalar functions (UDF's), aggregations (UDAF's), and table functions (UDTF's). Components of Hive include HCatalog and WebHCat. Hadoop Download. Maui-indexer - Maui - Multi-purpose automatic topic indexing.

Summary Maui automatically identifies main topics in text documents.

maui-indexer - Maui - Multi-purpose automatic topic indexing

Depending on the task, topics are tags, keywords, keyphrases, vocabulary terms, descriptors, index terms or titles of Wikipedia articles. Maui performs the following tasks: term assignment with a controlled vocabulary (or thesaurus) subject indexing topic indexing with terms from Wikipedia keyphrase extraction terminology extraction automatic tagging It can also be used for terminology extraction and semi-automatic topic indexing.

New:Try out Maui demo! Important: Questions regarding usage, bug reports or support? Also: read more on Download, Installation and Usage pages. Domain and language independence Maui has been successfully tested on computer science, agricultural, medicine, physics, biology, bioinformatics documents, as well as on blog posts and news articles. Examples are provided in Maui's Wiki pages Background Maui has been developed by Olena Medelyan as a part of her PhD project, under supervision of Ian H. I - RapidMiner. Using Revolution R Enterprise With Apache Hadoop for 'Big Analytics' Parallel Performance Without Parallel Complexity Big Data drives optimum value when it yields fast insights.

Using Revolution R Enterprise With Apache Hadoop for 'Big Analytics'

Adopting MPP data warehouses or Hadoop clusters alone to store Big Data isn’t enough. As data grows, so does complexity and computational workload analyzing Big Data. Big Data Analytics Cripples Legacy Tools If you have searched to find a way to easily scale analytics for EDWs and Hadoop using your legacy analytical tools you’ve likely encountered some of these crippling issues: Complexity: Pioneering users have found that writing analytics in MapReduce or in SQL-based tools is a tedious, error-prone process.Inflexibility: Binding R scripts to SQL-based algorithms restricts functionality and constrains performance.Vendor Dependence: Using R-to-MapReduce translation solutions or R wrappers for SQL algorithms locks your analytical applications to specific platforms complicating any future platform evolutions.

Revolution Provides Analytical Scale That Is Also Easy to Use. Chorus: Productivity engine for Data Science Teams.