Clustering

> >

Graph clustering: an approach based on "prominent vertexes" I was doing some experiments on the entropy graph, and I noticed that under some conditions it is a good marker of special vertexes. I'll try to explain the concept without formulas: that's really challenging because of my English :) These vertexes play a special role that I decided (arbitrarily) to call "Prominent Vertexes": basically if you remove them, you introduce some disconnections on the graph; the remaining connected components are the clusters that induce the graph partitioning.

To be more precise the connected components return just the N-1 clusters, the last cluster is obtained by the complement operation between the original graph and the N-1 clusters. The intuitive explanation Consider all the possible paths between two nodes A and B in a graphLet's take for the above paths the intersection of them.The result is a set of nodes that are essential to connect the two nodes: these nodes are the prominent nodes for the vertexes A and B Some result 1st example Comment: 2nd example c.

Detecting Communities in Social Graph. In analyzing social network, one common problem is how to detecting communities, such as groups of people who knows or interacting frequently with each other. Community is a subgraph of a graph where the connectivity are unusually dense. In this blog, I will enumerate some common algorithms on finding communities. First of all, community detection can be think of graph partitioning problem. In this case, a single node will belong to no more than one community. High Betweenness Edge Removal The intuition is that members within a community are densely connected and have many paths to reach each other.

Therefore, by removing these high-betweenness links, the graph will be segregated into communities. Algorithm: Hierarchical Clustering This is a very general approach of detecting communities. Random Walk Random walk can be used to compute the distance between every pair of nodes node-B and node-C. Notice that the pick of beta is important. There is an analytical solution to this problem. InfoBlog: SpotSigs: Are Stopwords Finally Good for Something? (Posted by Martin Theobald) In almost all classical InformationRetrieval settings that have a text processing component, stopwords are first discarded before anything interesting happens with the document. “Interesting” here might mean indexing the content for search, extracting features for automatic classification, or some other form of content analysis of whatever flavor. Jonathan (my co-author on the SpotSigs paper) had the amazing idea that stopwords may however be very good indicators of the actual interesting parts of a web page.

It is especially useful to know where the interesting parts of a web page are when they are interspersed with “added-value” content such as advertisements or navigational banners. This is most strikingly the case with online news articles, but applies more generally across the web. In our SpotSigs project, we tried to detect near-duplicate Web pages in the news domain.

Have a look at the paper or slides for more details. Detecting near duplicates in big data. I finally got to a WWW 2007 paper out of Google I have been meaning to read, "Detecting Near-Duplicates for Web Crawling" (PDF) by Gurmeet Manku, Arvind Jain, and Anish Sarma. The paper takes more theoretical work from Moses Charikar back in 2002, "Similarity Estimation Techniques from Rounding Algorithms" (ACM site), which describes a form of locality sensitive hashing, and applies it at very large scale (8B documents), dealing with all the practical issues along the way. In that sense, this Google WWW 2007 paper has a lot of similarities with Monika Henzinger's excellent SIGIR 2006 paper, "Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms" (ACM site). That paper evaluated the shingles method of creating fingerprints of substrings from a document against Charikar's technique on a 1.6B document data set and found the latter to be superior.

Some excerpts from the Google WWW 2007 paper: Documents that are exact duplicates of each other ... are easy to identify ... Clever method of near duplicate detection. Martin Theobald, Jonathan Siddharth, and Andreas Paepcke from Stanford University have a cute idea in their SIGIR 2008 paper, "SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections" (PDF).

They focus near duplicate detection on the important parts of a web pages by using the next few words after a stop word as a signature. An extended excerpt: The frequent presence of many diverse semantic units in individual Web pages makes near-duplicate detection particularly difficult. Frame elements for branding, and advertisements are often freely interspersed with other content elements.

The paper gives an example of a generating a spot signature where a piece of text like "at a rally to kick off a weeklong campaign" produces two spots: "a:rally:kick" and "a:weeklong:campaign". What I particularly like about this paper is that they take a very hard problem and find a beautifully simple solution.