background preloader

Structured diff

Facebook Twitter

Longest common subsequence problem. The longest common subsequence (LCS) problem is to find the longest subsequence common to all sequences in a set of sequences (often just two).

Longest common subsequence problem

(Note that a subsequence is different from a substring, for the terms of the former need not be consecutive terms of the original sequence.) It is a classic computer science problem, the basis of file comparison programs such as diff, and has applications in bioinformatics. Complexity[edit] For the general case of an arbitrary number of input sequences, the problem is NP-hard.[1] When the number of sequences is constant, the problem is solvable in polynomial time by dynamic programming (see Solution below).

Assume you have sequences of lengths . Hirschberg's algorithm. Damerau–Levenshtein distance. While the original motivation was to measure distance between human misspellings to improve applications such as spell checkers, Damerau–Levenshtein distance has also seen uses in biology to measure the variation between DNA.

Damerau–Levenshtein distance

Algorithm[edit] Adding transpositions sounds simple, but in reality there is a serious complication. Bitap algorithm. The bitap algorithm (also known as the shift-or, shift-and or Baeza-Yates–Gonnet algorithm) is an approximate string matching algorithm.

Bitap algorithm

The algorithm tells whether a given text contains a substring which is "approximately equal" to a given pattern, where approximate equality is defined in terms of Levenshtein distance — if the substring and pattern are within a given distance k of each other, then the algorithm considers them equal. The algorithm begins by precomputing a set of bitmasks containing one bit for each element of the pattern. Then it is able to do most of the work with bitwise operations, which are extremely fast. Due to the data structures required by the algorithm, it performs best on patterns less than a constant length (typically the word length of the machine in question), and also prefers inputs over a small alphabet.

Exact searching[edit] The bitap algorithm for exact string searching, in full generality, looks like this in pseudocode: Fuzzy searching[edit] Needleman–Wunsch algorithm. The Needleman–Wunsch algorithm is an algorithm used in bioinformatics to align protein or nucleotide sequences.

Needleman–Wunsch algorithm

It was published in 1970 by Saul B. Needleman and Christian D. Wunsch;[1] it uses dynamic programming, and was the first application of dynamic programming to biological sequence comparison. It is sometimes referred to as the optimal matching algorithm. Needleman-Wunsch pairwise sequence alignment Sequences Best Alignments --------- ---------------------- GATTACA G-ATTACA G-ATTACA GCATGCU GCATG-CU GCA-TGCU A modern presentation[edit] Scores for aligned characters are specified by a similarity matrix. Smith–Waterman algorithm. The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings or nucleotide or protein sequences.

Smith–Waterman algorithm

Instead of looking at the total sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure. The algorithm was first proposed by Temple F. Smith and Michael S. Waterman in 1981.[1] Like the Needleman–Wunsch algorithm, of which it is a variation, Smith–Waterman is a dynamic programming algorithm. As such, it has the desirable property that it is guaranteed to find the optimal local alignment with respect to the scoring system being used (which includes the substitution matrix and the gap-scoring scheme). Explanation[edit] is built as follows: Where: Matching, diffing and merging XML. Update: A newer, more complete version is here.

Matching, diffing and merging XML

I've said bad things about my job working on Carleton College's website, but fundamentally it's a really sound work environment we have. Just before winter break, one of the full-time employees came to me and asked if I could make a diff between two XHTML documents for use in Carleton's CMS, Reason. This would be useful for (a) comparing versions of a document in the CMS (b) merging documents, in case two people edit the same document at the same time, so as to avoid locks and the need for manual merges.

They came to me because I told them I'd written an XML parser. I may know about XML, but I know (or rather knew) nothing about the algorithms required for such a diff and merge. Daisydiff - Project Hosting on Google Code. Daisy Diff is a Java library that diffs (compares) HTML files.

daisydiff - Project Hosting on Google Code

It highlights added and removed words and annotates changes to the styling. (Examples) This project was a Google Summer of Code 2007 project for DaisyCMS where it's actively used for diffing HTML content. Java Notes. By Stuart D.

Java Notes

Gathman Last updated Apr 01, 2013 I have moved the most popular items to the top of the menu. Obsolete, but historically interesting Class Packager for Java If you want to deliver an application or applet with all the classes and resources it needs - and only the classes and resources it needs, then you need ZipLock.java. This is a utility class for examining a list of class names for all of their dependencies. When you create a JAR file with all the classes needed for your application, the JAR file is self-contained. Exploring the problems involved in comparing XML. Diffxml - XML Diff and Patch Utilities. Open Source XML Diff Written in Java.