background preloader

Bitap algorithm

Bitap algorithm
The bitap algorithm (also known as the shift-or, shift-and or Baeza-Yates–Gonnet algorithm) is an approximate string matching algorithm. The algorithm tells whether a given text contains a substring which is "approximately equal" to a given pattern, where approximate equality is defined in terms of Levenshtein distance — if the substring and pattern are within a given distance k of each other, then the algorithm considers them equal. The algorithm begins by precomputing a set of bitmasks containing one bit for each element of the pattern. Then it is able to do most of the work with bitwise operations, which are extremely fast. Due to the data structures required by the algorithm, it performs best on patterns less than a constant length (typically the word length of the machine in question), and also prefers inputs over a small alphabet. Exact searching[edit] The bitap algorithm for exact string searching, in full generality, looks like this in pseudocode: Fuzzy searching[edit]

Ctrie Not to be confused with C-trie. Operation[edit] The Ctrie data structure is a non-blocking concurrent hash array mapped trie based on single-word compare-and-swap instructions in a shared-memory system. It supports concurrent lookup, insert and remove operations. Keys are inserted by doing an atomic compare-and-swap operation on the node which needs to be modified. The figure above illustrates the Ctrie insert operation. The Ctrie is defined by the pointer to the root indirection node (or a root I-node). A C-node is a branching node. bit = bmp & (1 << ((hashcode >> level) & 0x1F)) pos = bitcount((bit - 1) & bmp) Note that the operations treat only the I-nodes as mutable nodes - all other nodes are never changed after being created and added to the Ctrie. Below is an illustration of the pseudocode of the insert operation: def insert(k, v) r = READ(root) if iinsert(r, k, v, 0, null) = RESTART insert(k, v) Advantages of Ctries[edit] Problems with Ctries[edit] Implementations[edit] References[edit]

Detecting similar and identical images using perseptual hashes - Hacker Labs Couple of my hobbies are travelling and photography. I love to take pictures and experiment with photography. Usually after my trips, I just copy the photos to either my iPad or couple of my external hard disks. After 10 years, I have over 200K photos distributed across several disks and machines. I had to find a way to organize these photos and create a workflow for future maintenance. First, I needed to find out what exactly is a duplicate image. Identical images: There were multiple copies of the same photo in different directories with different names.Similar images: I usually bracket (exposure compensate or flash compensate) important pictures. Identical photos (1) are easy to find. Finding similar photos (2) is a little more challenging. Exposure or flash compensated, so photos were slightly lighter/darker than the original.A small difference in the point of focus or a slight change in point of view.Photos with changed aspect ratio.Resized photos, etc. Image similarity algorithms

pHash.org: Home of pHash, the open source perceptual hash library Getting Cirrius: Calculating Similarity (Part 2): Jaccard, Sørensen and Jaro-Winkler Similarity In my last post on calculating string similarity, I focused on one algorithm (Cosine Similarity) and some of the concepts involved in making its calculation. In this post, I want to introduce three more ways of calculating similarity. The three methods I present today are all a little less intensive to calculate than cosine similarity, and one in particular is probably more accurate. The three algorithms I will be presenting today are: Jaccard SimilaritySørensen SimilarityJaro-Winkler Similarity I'm going to present the implementation of each of these algorithms, and then at the end of the post, the results they return for a couple example sets of strings. Let's begin in order, starting with Jaccard Similarity. Jaccard Similarity is very easy to calculate and actually required no extra programming on my part to implements (all the utility functions required were created for the Cosine Similarity implementation). Here is my implementation of the Jaccard Similarity Coefficient: Resources:

Intermediate Data Containers A few basic data structures: Dynamic arrays Linked lists Unordered maps Ordered maps Ordered maps (over finite keys) Fully persistent ordered maps Fully persistent ordered sets Heaps 1. Compact dynamic array (compact-arrays) An indexable deque which is optimal in space and time [1]. This is simply a O(sqrtN) array of O(sqrtN) sub-arrays. Two lists of arrays are maintained, small and big (twice bigger) Also, pointers to head/tail indexes, and the big/small separation are maintained. Conceptually, the virtual array is the concatenation of all small sub-arrays followed by the big sub-arrays, and indexed between head/tail. All operations are straightforward. Asymptotic complexity: O(1) worst case queries (get/set) O(1) amortized, O(sqrtN) worst case update (push/pop) at both ends N + O(sqrtN) records of space Variant: Compact integer arrays This is implemented by growing the integer range of sub-arrays dynamically when an update overflows. 2. Monolithic lists Succinct lists Thus at least lg(N!) 3. 4.

Related: