background preloader

Bloom filter

Bloom filter
Bloom proposed the technique for applications where the amount of source data would require an impracticably large hash area in memory if "conventional" error-free hashing techniques were applied. He gave the example of a hyphenation algorithm for a dictionary of 500,000 words, out of which 90% follow simple hyphenation rules, but the remaining 10% require expensive disk accesses to retrieve specific hyphenation patterns. With sufficient core memory, an error-free hash could be used to eliminate all unnecessary disk accesses; on the other hand, with limited core memory, Bloom's technique uses a smaller hash area but still eliminates most unnecessary accesses. More generally, fewer than 10 bits per element are required for a 1% false positive probability, independent of the size or number of elements in the set (Bonomi et al. (2006)). Algorithm description[edit] An example of a Bloom filter, representing the set {x, y, z}. Space and time advantages[edit] The false positive probability . . Related:  abstract data types and data structures

Hash table A small phone book as a hash table Hashing[edit] The idea of hashing is to distribute the entries (key/value pairs) across an array of buckets. Given a key, the algorithm computes an index that suggests where the entry can be found: index = f(key, array_size) Often this is done in two steps: hash = hashfunc(key) index = hash % array_size In this method, the hash is independent of the array size, and it is then reduced to an index (a number between 0 and array_size − 1) using the modulus operator (%). Choosing a good hash function[edit] A good hash function and implementation algorithm are essential for good hash table performance, but may be difficult to achieve. The distribution needs to be uniform only for table sizes that occur in the application. For open addressing schemes, the hash function should also avoid clustering, the mapping of two or more keys to consecutive slots. Perfect hash function[edit] Perfect hashing allows for constant time lookups in the worst case. Key statistics[edit]

natural language processing blog Red–black tree A red–black tree is a data structure which is a type of self-balancing binary search tree. Balance is preserved by painting each node of the tree with one of two colors (typically called 'red' and 'black') in a way that satisfies certain properties, which collectively constrain how unbalanced the tree can become in the worst case. When the tree is modified, the new tree is subsequently rearranged and repainted to restore the coloring properties. The properties are designed in such a way that this rearranging and recoloring can be performed efficiently. The balancing of the tree is not perfect but it is good enough to allow it to guarantee searching in O(log n) time, where n is the total number of elements in the tree. Tracking the color of each node requires only 1 bit of information per node because there are only two colors. History[edit] Terminology[edit] Properties[edit] An example of a red–black tree Analogy to B-trees of order 4[edit] Notes Applications and related data structures[edit]

免费的英语语料库汇总 Open English Corpora(1) - jinchangge的日志 - 网易博客 网易 新闻 微博 邮箱 相册 阅读 有道 摄影 爱拍 优惠券 云笔记 闪电邮 手机邮 印像派 网易识字 更多 博客 手机博客 博客搬家 博客VIP服务 LiveWriter写博 word写博 邮件写博 短信写博 群博客 博客油菜地 博客话题 博客热点 博客圈子 找朋友 发现 小组 风格 手机博客 网易真人搭配社区iStyle 下载最文艺的手机博客APP> 收藏级艺术作品,限时售卖>> 创建博客 登录 加关注 显示下一条 | 关闭 温馨提示! jinchangge的博客 趣味大学英语 导航 日志 jinchang 加博友 关注他 他的网易微博 被推荐日志 最新日志 该作者的其他文章 博主推荐 随机阅读 首页推荐 更多>> 10 Fastest Mammals(哺乳动物)of Our Planet 6 Bars with the Best Views in the World 免费的英语语料库汇总 Open English Corpora(1) 2010-06-28 18:06:45| 分类: 语料库 | 标签: |举报 |字号大中小 订阅 The list is constantly updated. Strictly speaking, some of them are not corpora, but archives, databases or even dictionaries. 1. Corpus of Global Web-Based English (GloWbE): COCA: COHA: Download N-Grams from COCA and COHA: BYU-TIME: Bank of English (BoE): 1 month free trial A. B. C.

Heap Example of a complete binary max-heap with node keys being integers from 1 to 100 1. the min-heap property: the value of each node is greater than or equal to the value of its parent, with the minimum-value element at the root. 2. the max-heap property: the value of each node is less than or equal to the value of its parent, with the maximum-value element at the root. Throughout this article the word heap will always refer to a min-heap. In a heap the highest (or lowest) priority element is always stored at the root, hence the name heap. Note that, as shown in the graphic, there is no implied ordering between siblings or cousins and no implied sequence for an in-order traversal (as there would be in, e.g., a binary search tree). A heap data structure should not be confused with the heap which is a common name for the pool of memory from which dynamically allocated memory is allocated. Heaps are usually implemented in an array, and do not require pointers between elements.

Languages - Homepage: All you need to start learning a foreign language Disjoint-set data structure MakeSet creates 8 singletons. After some operations of Union, some sets are grouped together. Find: Determine which subset a particular element is in. In order to define these operations more precisely, some way of representing the sets is needed. Disjoint-set linked lists[edit] A simple approach to creating a disjoint-set data structure is to create a linked list for each set. MakeSet creates a list of one element. This can be avoided by including in each linked list node a pointer to the head of the list; then Find takes constant time, since this pointer refers directly to the set representative. When the length of each list is tracked, the required time can be improved by always appending the smaller list to the longer. Analysis of the naive approach[edit] We now explain the bound above. Suppose you have a collection of lists and each node of each list contains an object, the name of the list to which it belongs, and the number of elements in that list. (i.e. there are elements overall).

Binary search tree A binary search tree of size 9 and depth 3, with 8 at the root. The leaves are not drawn. Binary search trees keep their keys in sorted order, so that lookup and other operations can use the principle of binary search: when looking for a key in a tree (or a place to insert a new key), they traverse the tree from root to leaf, making comparisons to keys stored in the nodes of the tree and deciding, on the basis of the comparison, to continue searching in the left or right subtrees. On average, this means that each comparison allows the operations to skip about half of the tree, so that each lookup, insertion or deletion takes time proportional to the logarithm of the number of items stored in the tree. This is much better than the linear time required to find items by key in an (unsorted) array, but slower than the corresponding operations on hash tables. Definition[edit] Frequently, the information represented by each node is a record rather than a single data element. Operations[edit] D.

Linked list A linked list whose nodes contain two fields: an integer value and a link to the next node. The last node is linked to a terminator used to signify the end of the list. The principal benefit of a linked list over a conventional array is that the list elements can easily be inserted or removed without reallocation or reorganization of the entire structure because the data items need not be stored contiguously in memory or on disk, while an array has to be declared in the source code, before compiling and running the program. Linked lists allow insertion and removal of nodes at any point in the list, and can do so with a constant number of operations if the link previous to the link being added or removed is maintained during list traversal. On the other hand, simple linked lists by themselves do not allow random access to the data, or any form of efficient indexing. Advantages[edit] Disadvantages[edit] History[edit] Basic concepts and nomenclature[edit] Singly linked list[edit] Tradeoffs[edit]

Priority queue Operations[edit] A priority queue must at least support the following operations: More advanced implementations may support more complicated operations, such as pull_lowest_priority_element, inspecting the first few highest- or lowest-priority elements, clearing the queue, clearing subsets of the queue, performing a batch insert, merging two or more queues into one, incrementing priority of any element, etc. Similarity to queues[edit] Stacks and queues may be modeled as particular kinds of priority queues. In a stack, the priority of each inserted element is monotonically increasing; thus, the last element inserted is always the first retrieved. Implementation[edit] Naive implementations[edit] There are a variety of simple, usually inefficient, ways to implement a priority queue. Usual implementation[edit] Note that from a computational-complexity standpoint, priority queues are congruent to sorting algorithms. Specialized heaps[edit] Equivalence of priority queues and sorting algorithms[edit]

Stack (abstract data type) Similar to a stack of plates, adding or removing is only possible at the top. Simple representation of a stack runtime with push and pop operations. push, which adds an element to the collection, andpop, which removes the most recently added element that was not yet removed. Considered as a linear data structure, or more abstractly a sequential collection, the push and pop operations occur only at one end of the structure, referred to as the top of the stack. A stack is needed to implement depth-first search. A stack can be easily implemented either through an array or a linked list. An array can be used to implement a (bounded) stack, as follows. structure stack: maxsize : integer top : integer items : array of item procedure initialize(stk : stack, size : integer): stk.items ← new array of size items, initially empty stk.maxsize ← size stk.top ← 0 The push operation adds an element and increments the top index, after checking for overflow: structure frame: data : item next : frame or nil

Locality of reference In computer science, locality of reference, also known as the principle of locality, is a phenomenon describing the same value, or related storage locations, being frequently accessed. There are two basic types of reference locality – temporal and spatial locality. Temporal locality refers to the reuse of specific data, and/or resources, within a relatively small time duration. Spatial locality refers to the use of data elements within relatively close storage locations. Sequential locality, a special case of spatial locality, occurs when data elements are arranged and accessed linearly, such as, traversing the elements in a one-dimensional array. Types of locality[edit] There are several different types of locality of reference: Temporal locality If at one point in time a particular memory location is referenced, then it is likely that the same location will be referenced again in the near future. Spatial locality Equidistant locality Reasons for locality[edit] Predictability See also[edit]

Dynamic array Several values are inserted at the end of a dynamic array using geometric expansion. Grey cells indicate space reserved for expansion. Most insertions are fast (constant time), while some are slow due to the need for reallocation (Θ(n) time, labelled with turtles). In computer science, a dynamic array, growable array, resizable array, dynamic table, mutable array, or array list is a random access, variable-size list data structure that allows elements to be added or removed. A dynamic array is not the same thing as a dynamically allocated array, which is an array whose size is fixed when the array is allocated, although a dynamic array may use such a fixed-size array as a back end.[1] Bounded-size dynamic arrays and capacity[edit] A simple dynamic array can be constructed by allocating an array of fixed-size, typically larger than the number of elements immediately required. Geometric expansion and amortized cost[edit] Growth factor[edit] Performance[edit] Variants[edit] References[edit]

Related: