# Bloom filter

Bloom proposed the technique for applications where the amount of source data would require an impracticably large hash area in memory if "conventional" error-free hashing techniques were applied. He gave the example of a hyphenation algorithm for a dictionary of 500,000 words, out of which 90% follow simple hyphenation rules, but the remaining 10% require expensive disk accesses to retrieve specific hyphenation patterns. With sufficient core memory, an error-free hash could be used to eliminate all unnecessary disk accesses; on the other hand, with limited core memory, Bloom's technique uses a smaller hash area but still eliminates most unnecessary accesses. More generally, fewer than 10 bits per element are required for a 1% false positive probability, independent of the size or number of elements in the set (Bonomi et al. (2006)). Algorithm description An example of a Bloom filter, representing the set {x, y, z}. Space and time advantages The false positive probability . .

Dijkstra's algorithm The algorithm exists in many variants; Dijkstra's original variant found the shortest path between two nodes,[3] but a more common variant fixes a single node as the "source" node and finds shortest paths from the source to all other nodes in the graph, producing a shortest-path tree. For a given source node in the graph, the algorithm finds the shortest path between that node and every other.[4]:196–206 It can also be used for finding the shortest paths from a single node to a single destination node by stopping the algorithm once the shortest path to the destination node has been determined. For example, if the nodes of the graph represent cities and edge path costs represent driving distances between pairs of cities connected by a direct road, Dijkstra's algorithm can be used to find the shortest route between one city and all other cities. Dijkstra's original algorithm does not use a min-priority queue and runs in time (where is the number of nodes). History Algorithm is and

Why Bloom filters work the way they do Imagine you’re a programmer who is developing a new web browser. There are many malicious sites on the web, and you want your browser to warn users when they attempt to access dangerous sites. For example, suppose the user attempts to access You’d like a way of checking whether domain is known to be a malicious site. What’s a good way of doing this? An obvious naive way is for your browser to maintain a list or set data structure containing all known malicious domains. In this post I’ll describe a data structure which provides an excellent way of solving this kind of problem. Most explanations of Bloom filters cut to the chase, quickly explaining the detailed mechanics of how Bloom filters work. In this post I take an unusual approach to explaining Bloom filters. Of course, this means that if your goal is just to understand the mechanics of Bloom filters, then this post isn’t for you. A stylistic note: Most of my posts are code-oriented. of objects. is a member of . ?

Hash table A small phone book as a hash table Hashing The idea of hashing is to distribute the entries (key/value pairs) across an array of buckets. Given a key, the algorithm computes an index that suggests where the entry can be found: index = f(key, array_size) Often this is done in two steps: hash = hashfunc(key) index = hash % array_size In this method, the hash is independent of the array size, and it is then reduced to an index (a number between 0 and array_size − 1) using the modulus operator (%). Choosing a good hash function A good hash function and implementation algorithm are essential for good hash table performance, but may be difficult to achieve. The distribution needs to be uniform only for table sizes that occur in the application. For open addressing schemes, the hash function should also avoid clustering, the mapping of two or more keys to consecutive slots. Perfect hash function Perfect hashing allows for constant time lookups in the worst case. Key statistics

Donut math: how donut.c works -- a1k0n There has been a sudden resurgence of interest in my "donut" code from 2006, and I’ve had a couple requests to explain this one. It’s been five years now, so it’s not exactly fresh in my memory, so I will reconstruct it from scratch, in great detail, and hopefully get approximately the same result. This is the code and the output, animated in Javascript: At its core, it’s a framebuffer and a Z-buffer into which I render pixels. Since it’s just rendering relatively low-resolution ASCII art, I massively cheat. All it does is plot pixels along the surface of the torus at fixed-angle increments, and does it densely enough that the final result looks solid. So how do we do that? To render a 3D object onto a 2D screen, we project each point (x,y,z) in 3D-space onto a plane located z’ units away from the viewer, so that the corresponding 2D position is (x’,y’). \begin{aligned} \frac{y'}{z'} &= \frac{y}{z} \\ y' &= \frac{y z'}{z}. . and use that when depth buffering because: . Here it is:

Operational transformation Operational transformation (OT) is a technology for supporting a range of collaboration functionalities in advanced collaborative software systems. OT was originally invented for consistency maintenance and concurrency control in collaborative editing of plain text documents. Two decades of research has extended its capabilities and expanded its applications to include group undo, locking, conflict resolution, operation notification and compression, group-awareness, HTML/XML and tree-structured document editing, collaborative office productivity tools, application-sharing, and collaborative computer-aided media design tools (see OTFAQ). In 2009 OT was adopted as a core technique behind the collaboration features in Apache Wave and Google Docs. History Operational Transformation was pioneered by C. System architecture Basics The basic idea of OT can be illustrated by using a simple text editing scenario as follows. Consistency models The CC model T(ins( ),ins( and

Red–black tree A red–black tree is a data structure which is a type of self-balancing binary search tree. Balance is preserved by painting each node of the tree with one of two colors (typically called 'red' and 'black') in a way that satisfies certain properties, which collectively constrain how unbalanced the tree can become in the worst case. When the tree is modified, the new tree is subsequently rearranged and repainted to restore the coloring properties. The properties are designed in such a way that this rearranging and recoloring can be performed efficiently. The balancing of the tree is not perfect but it is good enough to allow it to guarantee searching in O(log n) time, where n is the total number of elements in the tree. Tracking the color of each node requires only 1 bit of information per node because there are only two colors. History Terminology Properties An example of a red–black tree Analogy to B-trees of order 4 Notes Applications and related data structures

S-99: Ninety-Nine Scala Problems These are an adaptation of the Ninety-Nine Prolog Problems written by Werner Hett at the Berne University of Applied Sciences in Berne, Switzerland. I (Phil Gold) have altered them to be more amenable to programming in Scala. Feedback is appreciated, particularly on anything marked TODO. The problems have different levels of difficulty. Your goal should be to find the most elegant solution of the given problems. Solutions are available by clicking on the link at the beginning of the problem description. [I don't have example solutions to all of the problems yet. Working with lists In Scala, lists are objects of type List[A], where A can be any type. The solutions to the problems in this section will be in objects named after the problems (P01, P02, etc.). In many cases, there's more than one reasonable approach. P01 (*) Find the last element of a list. Example: scala> last(List(1, 1, 2, 3, 5, 8)) res0: Int = 8 P02 (*) Find the last but one element of a list. P05 (*) Reverse a list. Examples:

Genetic Programming: Evolution of Mona Lisa | Roger Alsing Weblog [EDIT] Added FAQ here: Gallery here: This weekend I decided to play around a bit with genetic programming and put evolution to the test, the test of fine art :-) I created a small program that keeps a string of DNA for polygon rendering. The procedure of the program is quite simple: 0) Setup a random DNA string (application start) 1) Copy the current DNA sequence and mutate it slightly 2) Use the new DNA to render polygons onto a canvas 3) Compare the canvas to the source image 4) If the new painting looks more like the source image than the previous painting did, then overwrite the current DNA with the new DNA 5) repeat from 1 Now to the interesting part :-) Could you paint a replica of the Mona Lisa using only 50 semi transparent polygons? That is the challenge I decided to put my application up to. So what do you think? Like this: Like Loading...

Heap Example of a complete binary max-heap with node keys being integers from 1 to 100 1. the min-heap property: the value of each node is greater than or equal to the value of its parent, with the minimum-value element at the root. 2. the max-heap property: the value of each node is less than or equal to the value of its parent, with the maximum-value element at the root. Throughout this article the word heap will always refer to a min-heap. In a heap the highest (or lowest) priority element is always stored at the root, hence the name heap. Note that, as shown in the graphic, there is no implied ordering between siblings or cousins and no implied sequence for an in-order traversal (as there would be in, e.g., a binary search tree). A heap data structure should not be confused with the heap which is a common name for the pool of memory from which dynamically allocated memory is allocated. Heaps are usually implemented in an array, and do not require pointers between elements.

Brushing Up on Computer Science Part 4, Algorithms » Victus Spiritus “An algorithm must be seen to be believed.”Donald Knuth How we ended up here It all began a few days ago with an email from a friend (thanks Denny). I was inspired to dust off my software engineering cap, and review a few choice topics in computer science. The table of contents for this blog series: Algorithms In short, an algorithm is a recipe. A formula or set of steps for solving a particular problem. Understanding Strategies beats Memorizing Tactics Algorithms are well specified techniques for performing an unbounded variety of tasks (good luck learning all algorithms). There are broad patterns common to vastly separated problem spaces. A couple of the most famous divide and conquer techniques are the FFT and MapReduce (slides, web article). I first came across map reduce while reading a set of slides by Jeff Dean, Designs, Lessons and Advice from Building Large Distributed Systems. Sorting Algorithms Quicksort The quicksort on average requires O(n log n) comparisons and worse case O(n^2).

Bitap algorithm The bitap algorithm (also known as the shift-or, shift-and or Baeza-Yates–Gonnet algorithm) is an approximate string matching algorithm. The algorithm tells whether a given text contains a substring which is "approximately equal" to a given pattern, where approximate equality is defined in terms of Levenshtein distance — if the substring and pattern are within a given distance k of each other, then the algorithm considers them equal. The algorithm begins by precomputing a set of bitmasks containing one bit for each element of the pattern. Due to the data structures required by the algorithm, it performs best on patterns less than a constant length (typically the word length of the machine in question), and also prefers inputs over a small alphabet. Exact searching The bitap algorithm for exact string searching, in full generality, looks like this in pseudocode: Fuzzy searching External links and references

Disjoint-set data structure MakeSet creates 8 singletons. After some operations of Union, some sets are grouped together. Find: Determine which subset a particular element is in. In order to define these operations more precisely, some way of representing the sets is needed. Disjoint-set linked lists A simple approach to creating a disjoint-set data structure is to create a linked list for each set. MakeSet creates a list of one element. This can be avoided by including in each linked list node a pointer to the head of the list; then Find takes constant time, since this pointer refers directly to the set representative. When the length of each list is tracked, the required time can be improved by always appending the smaller list to the longer. Analysis of the naive approach We now explain the bound above. Suppose you have a collection of lists and each node of each list contains an object, the name of the list to which it belongs, and the number of elements in that list. (i.e. there are elements overall).

Software Updates: Courgette The source code does not have this problem because all the entities in the source are symbolic. Functions don't get committed to a specific address until very late in the compilation process, during assembly or linking. If we could step backwards a little and make the internal pointers symbolic again, could we get smaller updates? Courgette uses a primitive disassembler to find the internal pointers. The disassembler splits the program into three parts: a list of the internal pointer's target addresses, all the other bytes, and an 'instruction' sequence that determines how the plain bytes and the pointers need to be interleaved and adjusted to get back the original input. We call this an 'assembly language' because we can run an 'assembler' to process the instructions and emit a sequence of bytes to recover the original file. We bring the pointers under control by introducing 'labels' for the addresses. How do we use this to generate a better diff? server: diff = bsdiff(original, update)

Related: