background preloader

Algorithms - Apache Mahout

Algorithms - Apache Mahout
Related:  Data Science

Running Hadoop on Windows « Hayes Davis What is Hadoop? Hadoop is a an open source Apache project written in Java and designed to provide users with two things: a distributed file system (HDFS) and a method for distributed computation. It’s based on Google’s published Google File System and MapReduce concept which discuss how to build a framework capable of executing intensive computations across tons of computers. Something that might, you know, be helpful in building a giant search index. What’s the big deal about running it on Windows? Hadoop’s key design goal is to provide storage and computation on lots of homogenous “commodity” machines; usually a fairly beefy machine running Linux. Caveat Emptor I’m one of the few that has invested the time to setup an actual distributed Hadoop installation on Windows. This guide uses Hadoop v0.17 and assumes that you don’t have any previous Hadoop installation. Bottom line: your mileage may vary, but this guide should get you started running Hadoop on Windows. Pre-Requisites Java Cygwin <?

K Means Clustering with Tf-idf Weights | Blog | Jonathan Zong Unsupervised learning algorithms in machine learning impose structure on unlabeled datasets. In Prof. Andrew Ng's inaugural ml-class from the pre-Coursera days, the first unsupervised learning algorithm introduced was k-means, which I implemented in Octave for programming exercise 7. Now, after the fact but with a fresh perspective and more experience, I will revisit the k-means algorithm in Java to implement text clustering. K-means is an algorithm designed to find coherent groups of data, a.k.a. clusters. Tf-idf Weighting Before being able to run k-means on a set of text documents, the documents have to be represented as mutually comparable vectors. Cosine Similarity Now that we're equipped with a numerical model with which to compare our data, we can represent each document as a vector of terms using a global ordering of each unique term found throughout all of the documents, making sure first to clean the input. k-means

Data Beta Lecture 6: Collaborative Filtering / Information Extraction Lecture 6: Collaborative Filtering / Information Extraction Tao Yang's Lecture ExpertRank: Ranking system for See US Patent Application 7028026 by Tao Yang, Wei Wang, and Apostolos Gerasoulis. Retrieve documents from inverted file. Cluster documents by content and by link structure Apply a hub/authority analysis to each clusters. Required Reading: Chakrabarti, sec 4.5 Evaluating collaborative filtering recommender systems By Jonathan Herlocker, Joseph Konstan, Loren Terveen, and John Reidl, ACM Transations on Information Systems, vol. 22, No. 1, 2004, pp. 5-53. Unsupervised Named-Entity Extraction from the Web. Additional Reading Recommendations: Item to Item Collaborative Filtering by Greg Linden, Brent Smith and Jeremy York, IEEE Internet Computing January-February 2003. Collaborative Filtering Example: Terms and Documents We say that document D is relevant to query term T if D contains T. Example: Personal preferences General issues in either of these: 1.

Google's Mind-Blowing Big-Data Tool Grows Open Source Twin | Wired Enterprise Silicon Valley startup MapR has launched an open source project called Drill, which seeks to mimic a shocking effectively data-analysis tool built by Google Mike Olson and John Schroeder shared a stage at a recent meeting of Silicon Valley’s celebrated Churchill Club, and they didn’t exactly see eye to eye. Olson is the CEO of a Valley startup called Cloudera, and Schroeder is the boss at MapR, a conspicuous Cloudera rival. Both outfits deal in Hadoop — a sweeping open source software platform based on data center technologies that underpinned the rise of Google’s web-dominating search engine — but in building their particular businesses, the two startups approached Hadoop from two very different directions. Whereas Cloudera worked closely with the open source Hadoop project to enhance the software code that’s freely available to the world at large, MapR decided to rebuild the platform from the ground up, and when that was done, it sold the new code as proprietary software. — Tomer Shiran

Clustering Snippets With Carrot2 | Index Data We’ve been investigating ways we might add result clustering to our metasearch tools. Here’s a short introduction to the topic and to an open source platform for experimenting in this area. Clustering Using a search interface that just takes some keywords often leads to miscommunication. To aid the user in narrowing results to just those applicable to the context they’re thinking about, a good deal of work has been done in the area of “clustering” searches. One common way to represent a document, both for searching and data mining, is the vector space model. This kind of bag-of-words model is very useful for separating documents into groups. Another differentiator among clustering algorithms is when the clustering happens, before or after search. Similarly, we can leverage another part of the search system: snippet generation. Carrot2 Suffix Tree Clustering (STC) is one of the first feasible snippet-based document clustering algorithms, proposed in 1998 by Zamir and Etzioni. Lingo

yooreeka - Google Code The Yooreeka project started with the code of the book "Algorithms of the Intelligent Web " (Manning 2009). Although the term "Web" prevailed in the title, in essence, the algorithms are valuable in any software application. An Errata page for the book has been posted here. The second major revision of the code (v. 2.x) will introduce some enhancements, some new features, and it will restructure the packages from the root org.yooreeka. You can find the Yooreeka 2.0 API (Javadoc) here and you can also visit us at our Google+ home. Lastly, Yooreeka 2.0 will be licensed under the Apache License rather than the somewhat more restrictive LGPL. Geeking with Greg How MySpace Tested Their Live Site with 1 Million Concurrent Users This is a guest post by Dan Bartow, VP of SOASTA, talking about how they pelted MySpace with 1 million concurrent users using 800 EC2 instances. I thought this was an interesting story because: that's a lot of users, it takes big cajones to test your live site like that, and not everything worked out quite as expected. I'd like to thank Dan for taking the time to write and share this article. In December of 2009 MySpace launched a new wave of streaming music video offerings in New Zealand, building on the previous success of MySpace music. If you manage the infrastructure that sits behind a high traffic application you don’t want any surprises. For MySpace, the goal was to test an additional 1 million concurrent users on their live site stressing the new video features. Here are the details of the load that was generated during testing. Test Environment Architecture SOASTA CloudTest™ manages calling out to cloud providers, in this case Amazon, and provisioning the servers for testing.

Using REST to Invoke the API - Custom Search The JSON/Atom Custom Search API lets you develop websites and applications to retrieve and display search results from Google Custom Search programmatically. With this API, you can use RESTful requests to get either web search or image search results in JSON or Atom format. Data format JSON/Atom Custom Search API can return results in one of two formats. There are also two external documents that are helpful resources for using this API: Google WebSearch Protocol (XML): The JSON/Atom Custom Search API provides a subset of the functionality provided by the XML API, but it instead returns data in JSON or Atom format.OpenSearch 1.1 Specification: This API uses the OpenSearch specification to describe the search engine and provide data regarding the results. Prerequisites Search engine ID By calling the API user issues requests against an existing instance of a Custom Search Engine. API key JSON/Atom Custom Search API requires the use of an API key. Pricing

Machine Learning Department - Carnegie Mellon University