background preloader

Nutch/mapReduce/hadloop

Facebook Twitter

Hadoop Tutorial. Introduction HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information. Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications. This module introduces the design of this distributed file system and instructions on how to operate it. Goals for this Module: Understand the basic design of HDFS and how it relates to basic distributed file system concepts Learn how to set up and use HDFS from the command line Learn how to use HDFS in your applications Outline Distributed File System Basics A distributed file system is designed to hold a large amount of data and provide access to this data to many clients distributed across a network.

NFS, the Network File System, is the most ubiquitous distributed file system. Configuring HDFS Cluster configuration. About Nutch. Nutch Wiki. Hadoop Sorts a Petabyte « Free Search. MapReduce cookbook for machine learning « Free Search. Amazon Mechanical Turk - Welcome. Free Search. Cloud: commodity or proprietary? « Free Search. A few days ago Google announced its App Engine, which lets folks build applications that run in Google’s cloud. Amazon has for a while had a number of services to let folks run applications in Amazon’s cloud.

But in both of these cases, one must use their proprietary APIs. For example, Google provides a datastore API that applications must use to persist state, while Amazon similarly provides a simple DB API. Amazon’s services are generally lower-level and easier to adopt ala-carte, while Google provides one-stop-shopping. Either way, one’s application code becomes dependent on a particular vendor. As we shift applications to the cloud, do we want our code to remain vendor-neutral? I think most would prefer not to be locked-in, that cloud providers instead sold commodity services.

Hadoop is a big initial step in this direction. Moral: if you want commodity cloud hosting, pitch in now. Like this: Like Loading... Tags: aws, cloud, hadoop.