K Means Clustering with Tf-idf Weights | Blog | Jonathan Zong Unsupervised learning algorithms in machine learning impose structure on unlabeled datasets. In Prof. Andrew Ng's inaugural ml-class from the pre-Coursera days, the first unsupervised learning algorithm introduced was k-means, which I implemented in Octave for programming exercise 7. Now, after the fact but with a fresh perspective and more experience, I will revisit the k-means algorithm in Java to implement text clustering. K-means is an algorithm designed to find coherent groups of data, a.k.a. clusters. Tf-idf Weighting Before being able to run k-means on a set of text documents, the documents have to be represented as mutually comparable vectors. Cosine Similarity Now that we're equipped with a numerical model with which to compare our data, we can represent each document as a vector of terms using a global ordering of each unique term found throughout all of the documents, making sure first to clean the input. k-means
Data Beta Clustering Snippets With Carrot2 | Index Data We’ve been investigating ways we might add result clustering to our metasearch tools. Here’s a short introduction to the topic and to an open source platform for experimenting in this area. Clustering Using a search interface that just takes some keywords often leads to miscommunication. To aid the user in narrowing results to just those applicable to the context they’re thinking about, a good deal of work has been done in the area of “clustering” searches. One common way to represent a document, both for searching and data mining, is the vector space model. This kind of bag-of-words model is very useful for separating documents into groups. Another differentiator among clustering algorithms is when the clustering happens, before or after search. Similarly, we can leverage another part of the search system: snippet generation. Carrot2 Suffix Tree Clustering (STC) is one of the first feasible snippet-based document clustering algorithms, proposed in 1998 by Zamir and Etzioni. Lingo
yooreeka - Google Code The Yooreeka project started with the code of the book "Algorithms of the Intelligent Web " (Manning 2009). Although the term "Web" prevailed in the title, in essence, the algorithms are valuable in any software application. An Errata page for the book has been posted here. The second major revision of the code (v. 2.x) will introduce some enhancements, some new features, and it will restructure the packages from the root org.yooreeka. You can find the Yooreeka 2.0 API (Javadoc) here and you can also visit us at our Google+ home. Lastly, Yooreeka 2.0 will be licensed under the Apache License rather than the somewhat more restrictive LGPL. Using REST to Invoke the API - Custom Search The JSON/Atom Custom Search API lets you develop websites and applications to retrieve and display search results from Google Custom Search programmatically. With this API, you can use RESTful requests to get either web search or image search results in JSON or Atom format. Data format JSON/Atom Custom Search API can return results in one of two formats. There are also two external documents that are helpful resources for using this API: Google WebSearch Protocol (XML): The JSON/Atom Custom Search API provides a subset of the functionality provided by the XML API, but it instead returns data in JSON or Atom format.OpenSearch 1.1 Specification: This API uses the OpenSearch specification to describe the search engine and provide data regarding the results. Prerequisites Search engine ID By calling the API user issues requests against an existing instance of a Custom Search Engine. API key JSON/Atom Custom Search API requires the use of an API key. Pricing
Machine Learning Department - Carnegie Mellon University Free Search API Are your looking for an alternative to Google Web Search API (depreciated), Yahoo Boss (commercial) or Bing Web Search API (commercial)?Try our FREE Web Search API! Prohibitive search infrastructure cost and high priced Search API are market entry barriers for innovative services and start-ups. The dramatic cost advantage of our unique p2p technology allows providing a Free Search API. With 1 million free queries per month we provide three orders of magnitude more than the incumbents do. An open platform, enabling innovation, competition & diversity in search! Build your own mobile news & search app, news clipping, trend monitoring, competitive intelligence, reputation management, brand monitoring, search engine optimization, plagiarism detection, alternative search engine, research project and more! Web Search More than 2 billion pages indexed. News Search News articles from newspapers, magazines and blogs. Trending News Trending news, grouped by topic. API Key Parameter Return Values
Fisher's method Under Fisher's method, two small p-valuesP1 and P2 combine to form a smaller p-value. The yellow-green boundary defines the region where the meta-analysis p-value is below 0.05. For example, if both p-values are around 0.10, or if one is around 0.04 and one is around 0.25, the meta-analysis p-value is around 0.05. In statistics, Fisher's method, also known as Fisher's combined probability test, is a technique for data fusion or "meta-analysis" (analysis of analyses). Application to independent test statistics Fisher's method combines extreme value probabilities from each test, commonly known as "p-values", into one test statistic (X2) using the formula where pi is the p-value for the ith hypothesis test. Limitations of independent assumption Dependence among statistical tests is generally positive, which means that the p-value of X2 is too small (anti-conservative) if the dependency is not taken into account. reduced for Extension to dependent test statistics
simple web crawler / scraper tutorial using requests module in python Let me show you how to use the Requests python module to write a simple web crawler / scraper. So, lets define our problem first. In this page: I am publishing some programming problems. So, now I shall write a script to get the links (url) of the problems. So, lets start. First make sure you can get the content of the page. import requests def get_page(url): r = requests.get(url) print r.status_code with open("test.html", "w") as fp: fp.write(r.text) if __name__ == "__main__": url = ' get_page(url) Now run the program: $ python cpbook_crawler.py 200Traceback (most recent call last): File "cpbook_crawler.py", line 15, in Hmm... we got an error. import reimport requests Now run the script: $ python cpbook_crawler.py  We got an empty list. content = content.replace("\n", '') You should add this line and run the program again. Now we write the regular expression to get the list of the urls.
cool ML links » How to make a web crawler in under 50 lines of Python code 'Net Instructions How to make a web crawler in under 50 lines of Python code September 24, 2011 Interested to learn how Google, Bing, or Yahoo work? And let’s see how it is run. Okay, but how does it work? Let’s first talk about what a web crawler’s purpose is. Web page content (the text and multimedia on a page)Links (to other web pages on the same website, or to other websites entirely) Which is exactly what this little “robot” does. Is this how Google works? Sort of. *Your search terms actually visit a number of databases simultaneously such as spell checkers, translation services, analytic and tracking servers, etc. Let’s look at the code in more detail! The following code should be fully functional for Python 3.x. Magic!