background preloader

Data Structrures & Algorithms

Facebook Twitter

Pdf/1204.3581v1. Getting Cirrius: Calculating Similarity (Part 2): Jaccard, Sørensen and Jaro-Winkler Similarity. In my last post on calculating string similarity, I focused on one algorithm (Cosine Similarity) and some of the concepts involved in making its calculation. In this post, I want to introduce three more ways of calculating similarity. The three methods I present today are all a little less intensive to calculate than cosine similarity, and one in particular is probably more accurate. The three algorithms I will be presenting today are: Jaccard SimilaritySørensen SimilarityJaro-Winkler Similarity I'm going to present the implementation of each of these algorithms, and then at the end of the post, the results they return for a couple example sets of strings.

Jaccard Similarity is very easy to calculate and actually required no extra programming on my part to implements (all the utility functions required were created for the Cosine Similarity implementation). The equation is remarkably simple (length of the intersect divided by the length of the union): The equation for Sørensen Similarity: Wavelet Trees - an Introduction. Personales.dcc.uchile.cl/~gnavarro/ps/wsp96.1.pdf. Bitap algorithm. The bitap algorithm (also known as the shift-or, shift-and or Baeza-Yates–Gonnet algorithm) is an approximate string matching algorithm. The algorithm tells whether a given text contains a substring which is "approximately equal" to a given pattern, where approximate equality is defined in terms of Levenshtein distance — if the substring and pattern are within a given distance k of each other, then the algorithm considers them equal.

The algorithm begins by precomputing a set of bitmasks containing one bit for each element of the pattern. Then it is able to do most of the work with bitwise operations, which are extremely fast. Due to the data structures required by the algorithm, it performs best on patterns less than a constant length (typically the word length of the machine in question), and also prefers inputs over a small alphabet. Exact searching[edit] The bitap algorithm for exact string searching, in full generality, looks like this in pseudocode: Fuzzy searching[edit] Www.fsl.cs.sunysb.edu/~rick/richard_spillane.pdf. Ctrie. Not to be confused with C-trie. Operation[edit] The Ctrie data structure is a non-blocking concurrent hash array mapped trie based on single-word compare-and-swap instructions in a shared-memory system.

It supports concurrent lookup, insert and remove operations. Just like the hash array mapped trie, it uses the entire 32-bit space for hash values thus having low risk of hashcode collisions. Each node may branch to up to 32 sub tries. To conserve memory, each node contains a 32 bits bitmap where each bit indicates the presence of a branch followed by an array of length equal to the Hamming weight of the bitmap. Keys are inserted by doing an atomic compare-and-swap operation on the node which needs to be modified. The figure above illustrates the Ctrie insert operation. The Ctrie is defined by the pointer to the root indirection node (or a root I-node). A C-node is a branching node. Bit = bmp & (1 << ((hashcode >> level) & 0x1F)) pos = bitcount((bit - 1) & bmp) Advantages of Ctries[edit] Cg.scs.carleton.ca/~morin/teaching/5408/notes/strings.pdf.

Arrays, Linked Lists, Huffman Trees, Leftist Trees. Introduction to Data Structures. Detecting similar and identical images using perseptual hashes - Hacker Labs. Couple of my hobbies are travelling and photography. I love to take pictures and experiment with photography. Usually after my trips, I just copy the photos to either my iPad or couple of my external hard disks. After 10 years, I have over 200K photos distributed across several disks and machines. I had to find a way to organize these photos and create a workflow for future maintenance. In this post I want to address one of the issues I had to solve: finding duplicate images . First, I needed to find out what exactly is a duplicate image. Analysing my photos, I found couple of interesting things: Identical images: There were multiple copies of the same photo in different directories with different names.Similar images: I usually bracket (exposure compensate or flash compensate) important pictures.

Identical photos (1) are easy to find. Finding similar photos (2) is a little more challenging. Image similarity algorithms Color histogram as fingerprint GQview’s image comparison Using Phash. pHash.org: Home of pHash, the open source perceptual hash library. Intermediate Data Containers. A few basic data structures: Dynamic arrays Linked lists Unordered maps Ordered maps Ordered maps (over finite keys) Fully persistent ordered maps Fully persistent ordered sets Heaps 1. Compact dynamic array (compact-arrays) An indexable deque which is optimal in space and time [1]. This is simply a O(sqrtN) array of O(sqrtN) sub-arrays. Two lists of arrays are maintained, small and big (twice bigger) Also, pointers to head/tail indexes, and the big/small separation are maintained.

Conceptually, the virtual array is the concatenation of all small sub-arrays followed by the big sub-arrays, and indexed between head/tail. All operations are straightforward. Asymptotic complexity: O(1) worst case queries (get/set) O(1) amortized, O(sqrtN) worst case update (push/pop) at both ends N + O(sqrtN) records of space Variant: Compact integer arrays This is implemented by growing the integer range of sub-arrays dynamically when an update overflows. 2. Monolithic lists Succinct lists Thus at least lg(N!) 3. 4.

Set, Map and Approximate Map Containers