background preloader

BigData

Facebook Twitter

Index of page view statistics for 2013-01.

Apache BigData

GridGain = In-Memory Big Data. Hortonworks Develops, Distributes and Supports Enterprise Apache Hadoop. Hadoop Platform as a Service in the Cloud. By Sriram Krishnan and Eva Tse, Data Science & Engineering Hadoop has become the de facto standard for managing and processing hundreds of terabytes to petabytes of data. At Netflix, our Hadoop-based data warehouse is petabyte-scale, and growing rapidly. However, with the big data explosion in recent times, even this is not very novel anymore.

Our architecture, however, is unique as it enables us to build a data warehouse of practically infinite scale in the cloud (both in terms of data and computational power). Architectural Overview In a traditional data center-based Hadoop data warehouse, the data is hosted on the Hadoop Distributed File System (HDFS). S3 as the Cloud Data Warehouse We use S3 as the “source of truth” for our cloud-based data warehouse. Multiple Hadoop Clusters for Different Workloads We currently use Amazon’s Elastic MapReduce (EMR) distribution of Hadoop. Tools and Gateways Our developers use a variety of tools in the Hadoop ecosystem. Why did we build Genie? Summary. How To Build Optimal Hadoop Cluster « Atlantbh :: Software Development, Hadoop, Big Data, Cloud, Outsourcing. Jan13 How To Build Optimal Hadoop Cluster Preface Amount of data stored in database/files is growing every day, using this fact there become a need to build cheaper, mainatenable and scalable environments capable of storing big amounts of data („Big Data“).

Conventional RDBMS systems became too expensive and not scalable based on today’s needs, so it is time to use/develop new techinques that will be able to satisfy our needs. One of the technologies that lead in these directions is Cloud computing. In this document I will try to explain how to build scalable Hadoop cluster where it is possible to store, index, search and maintain practically unlimited ammounts of data. This article will cover installation and configuration steps divided into these sections: Network architectureOperating SystemHardware requirementsHadoop software installation/setup Network Architecture In an effort to reduce the amount of background traffic, a virtual private network has been created for the cloud.

Hadoop and NoSQL: Interview with J. Chris Anderson | ODBMS Industry Watch. “The missing piece of the Hadoop puzzle is accounting for real time changes. Hadoop can give powerful analysis, but it is fundamentally a batch-oriented paradigm.” — J. Chris Anderson. How is Hadoop related to NoSQL databases? What are the main performance bottlenecks of NoSQL data stores? Q1. Chris Anderson : The missing piece of the Hadoop puzzle is accounting for real time changes. We are seeing interesting applications where Couchbase is used to enhance the batch-based Hadoop analysis with real time information, giving the effect of a continuous process. And this solves the data transfer costs issue you mention, because you essentially move the data out of Couchbase into Hadoop when it cools off. For folks working on problems like this, we have a Sqoop connector and we’ll be talking about it with Cloudera at our CouchConf in San Francisco on September 21.

Q2. Q3. Chris Anderson : Scaling up is easier from a software perspective. Q4. Q5. Chris Anderson : I agree. Q6. Q7. Q8. Q9.