background preloader

Welcome to Apache Pig!

Welcome to Apache Pig!
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties: Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks.

http://pig.apache.org/

Related:  Big Data - Gestion données de masse

Research Publication: Sawzall Interpreting the Data: Parallel Analysis with Sawzall Rob Pike, Sean Dorward, Robert Griesemer, Sean Quinlan Abstract Very large data sets often have a flat but regular structure and span multiple disks and machines. Examples include telephone call records, network logs, and web document repositories. Running Hadoop On Ubuntu Linux (Single-Node Cluster) @ Michael G. Noll In this tutorial I will describe the required steps for setting up a pseudo-distributed, single-node Hadoop cluster backed by the Hadoop Distributed File System, running on Ubuntu Linux. Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System (GFS) and of the MapReduce computing paradigm. Hadoop’s HDFS is a highly fault-tolerant distributed file system and, like Hadoop in general, designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets. The main goal of this tutorial is to get a simple Hadoop installation up and running so that you can play around with the software and learn more about it.

Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase comparison (Yes it's a long title, since people kept asking me to write about this and that too :) I do when it has a point.) While SQL databases are insanely useful tools, their monopoly in the last decades is coming to an end. And it's just time: I can't even count the things that were forced into relational databases, but never really fitted them. (That being said, relational databases will always be the best for the stuff that has relations.) But, the differences between NoSQL databases are much bigger than ever was between one SQL database and another.

22 free tools for data visualization and analysis You may not think you've got much in common with an investigative journalist or an academic medical researcher. But if you're trying to extract useful information from an ever-increasing inflow of data, you'll likely find visualization useful -- whether it's to show patterns or trends with graphics instead of mountains of text, or to try to explain complex issues to a nontechnical audience. There are many tools around to help turn data into graphics, but they can carry hefty price tags. The cost can make sense for professionals whose primary job is to find meaning in mountains of information, but you might not be able to justify such an expense if you or your users only need a graphics application from time to time, or if your budget for new tools is somewhat limited.

Kafka Prior releases: 0.7.x, 0.8.0. 1. Getting Started 1.1 Introduction Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. Installing Ubuntu inside Windows using VirtualBox This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. The screenshots in this tutorial use Ubuntu 12.04, but the same principles apply also to Ubuntu 12.10, 11.10, 10.04, and any future version of Ubuntu. Actually, you can install pretty much any Linux distribution this way. Introduction VirtualBox allows you to run an entire operating system inside another operating system. Please be aware that you should have a minimum of 512 MB of RAM. 1 GB of RAM or more is recommended.

Survey distributed databases - Toad for Cloud Wiki Overview Top This document, researched and authored by Quest's chief software architect Randy Guck, provides a summary of distributed databases. These are commercial products, open source projects, and research technologies that support massive data storage (petabyte+) using an architecture that distributes storage and processing across multiple servers. These can be considered “Internet age” databases that are being used by Amazon, Facebook, Google and the like to address performance and scalability requirements that cannot be met by traditional relational databases. Big Data Is As Misunderstood As Twitter Was Back In 2008 Boonsri Dickinson, Business Insider In 2008, when Howard Lindzon started StockTwits, no one knew what Twitter was. Obviously, that has changed. Now that Twitter is more of a mainstream communication channel, Lindzon has figured out the secret to getting past all the noise on Twitter. By using human curation, StockTwits can serve up relevant social media content to major players like MSN Money.

What Does Big Data Mean to Infrastructure Professionals? Big data means the amount of data you’re working with today will look trivial within five years.Huge amounts of data will be kept longer and have way more value than today’s archived data.Business people will covet a new breed of alpha geeks. You will need new skills around data science, new types of programming, more math and statistics skills and data hackers…lots of data hackers.You are going to have to develop new techniques to access, secure, move, analyze, process, visualize and enhance data; in near real time.You will be minimizing data movement wherever possible by moving function to the data instead of data to function. You will be leveraging or inventing specialized capabilities to do certain types of processing- e.g. early recognition of images or content types – so you can do some processing close to the head.The cloud will become the compute and storage platform for big data which will be populated by mobile devices and social networks. via: Most read Latest

Installing Ubuntu This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Future versions of this will be posted to my blog. NotesInstalling Ubuntu Notes This tutorial goes over the option of installing a traditional dual-boot. If there is any chance you might want to remove Ubuntu and return to Windows exclusively, do not set up a traditional dual-boot.

A platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs by sergeykucherov Jul 15

Related:  Data ManagementApache-Projects