background preloader

Bigdata

Facebook Twitter

R twotorials. The Apache Cassandra Project. Should you go Beyond Relational Databases? Relational databases, such as MySQL, PostgreSQL and various commercial products, have served us well for many years.

Should you go Beyond Relational Databases?

Lately, however, there has been a lot of discussion on whether the relational model is reaching the end of its life-span, and what may come after it. Should you care? Which database technology should you be using? Of course the answer is “it depends”, but that’s not very helpful. Let me ask you a few questions to help you figure out which technology is appropriate to your particular application. First of all, calm down. Do you have tables with lots of columns, only a few of which are actually used by any particular row? Other symptoms relate to the scalability of your system: Are you reaching the limit of the write capacity of a single database server? In my opinion, too much emphasis is often placed on scalability, despite being a very remote problem on most projects.

That said, there is one more reason to consider non-relational databases: they are fashionable. MapReduce. Data-Intensive Text Processing with MapReduce. Abstract Our world is being revolutionized by data-driven methods: access to large amounts of data has generated new insights and opened exciting new opportunities in commerce, science, and computing applications.

Data-Intensive Text Processing with MapReduce

Processing the enormous quantities of data necessary for these advances requires large clusters, making distributed computing paradigms more crucial than ever. MapReduce is a programming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. The programming model provides an easy-to-understand abstraction for designing scalable algorithms, while the execution framework transparently handles many system-level details, ranging from scheduling to synchronization to fault tolerance. This book focuses on MapReduce algorithm design, with an emphasis on text processing algorithms common in natural language processing, information retrieval, and machine learning.

Welcome to Apache™ Hadoop™! Large Scale Machine Learning using NVIDIA CUDA. Introduction You may have heard about the Stanford University’s machine learning on-line course given by Prof.

Large Scale Machine Learning using NVIDIA CUDA

Andrew Ng. in 2011; it was a great course with lots of real world examples. During the course I’ve realized that GPUs are the perfect solution for large scale machine learning problems. In fact, there are many examples about supervised and unsupervised learning all around the internet. Being a fan of both GPGPU and Machine Learning technologies, I came up with my own perspective to run machine learning algorithms with huge amount of data on the GPUs.

I’ve already presented the solution recently at the South Florida Code Camp 2012. There is a lot of concepts to machine learning but in this post I’m only scratching the surface. I’ve also prepared the same example using CUBLAS with vectorized implementation of the polynomial regression algorithm, but the CUBLAS example would require more in depth explanations. Background Machine Learning Figure 1 Figure 2 Figure 3 Gradient Descent ... Wp/wp-content/uploads/2011/09/02-libs.pdf. Tag: NVIDIA CUDA. This tutorial by Dan Cyca outlines the shared memory configurations for NVIDIA Fermi and Kepler architectures, and demonstrates how to rewrite kernels to take advantage of the changes in Kepler’s shared memory architecture.

Tag: NVIDIA CUDA

Developed in partnership with NVIDIA, this hands-on four day course will teach how to write and optimize applications that fully leverage the multi-core processing capabilities of the GPU. Benefits include: Hands-on exercises and progressive lecturesIndividual laptops equipped with NVIDIA GPUs for student useSmall class sizes to maximize learning90 days post training support – NEW! February 25-28, 2014, Baltimore, MD, USA, details and registration. This webinar recording provides an overview of the profiling techniques and the tools available to help you optimize your code.

The latest release 1.5.0 of the free open source linear algebra library ViennaCL is now available for download. Thrust - Parallel Algorithms Library. Database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf. Parallel Programming in the Age of Big Data. We’re now entering what I call the “Industrial Revolution of Data,” where the majority of data will be stamped out by machines: software logs, cameras, microphones, RFID readers, wireless sensor networks and so on.

Parallel Programming in the Age of Big Data

These machines generate data a lot faster than people can, and their production rates will grow exponentially with Moore’s Law. Storing this data is cheap, and it can be mined for valuable information. In this context, there is some good news for parallel programming. Data analysis software parallelizes fairly naturally. In fact, software written in SQL has been running in parallel for more than 20 years. To understand where we’re headed with parallel software, let’s look at what the computer industry has already accomplished.

The MapReduce programming model has turned a new page in the parallelism story. SQL provides a higher-level language that is more flexible and optimizable, but less familiar to many programmers. Slide courtesy of Green Plum.