Longest common subsequence problem The longest common subsequence (LCS) problem is to find the longest subsequence common to all sequences in a set of sequences (often just two). (Note that a subsequence is different from a substring, for the terms of the former need not be consecutive terms of the original sequence.) It is a classic computer science problem, the basis of file comparison programs such as diff, and has applications in bioinformatics. Complexity For the general case of an arbitrary number of input sequences, the problem is NP-hard. When the number of sequences is constant, the problem is solvable in polynomial time by dynamic programming (see Solution below). Assume you have
Algorithm information The Hirschberg algorithm can be derived from the Needleman-Wunsch algorithm by observing that: one can compute the optimal alignment score by only storing the current and previous row of the Needleman-Wunsch score matrix;if is the optimal alignment of , and is an arbitrary partition of , there exists a partition of such that . Hirschberg's algorithm
While the original motivation was to measure distance between human misspellings to improve applications such as spell checkers, Damerau–Levenshtein distance has also seen uses in biology to measure the variation between DNA. Algorithm Adding transpositions sounds simple, but in reality there is a serious complication. Presented here are two algorithms: the first, simpler one, computes what is known as the optimal string alignment (sometimes called the restricted edit distance), while the second one computes the Damerau–Levenshtein distance with adjacent transpositions. The difference between the two algorithms consists in that the optimal string alignment algorithm computes the number of edit operations needed to make the strings equal under the condition that no substring is edited more than once, whereas the second one presents no such restriction. Damerau–Levenshtein distance
The bitap algorithm (also known as the shift-or, shift-and or Baeza-Yates–Gonnet algorithm) is an approximate string matching algorithm. The algorithm tells whether a given text contains a substring which is "approximately equal" to a given pattern, where approximate equality is defined in terms of Levenshtein distance — if the substring and pattern are within a given distance k of each other, then the algorithm considers them equal. The algorithm begins by precomputing a set of bitmasks containing one bit for each element of the pattern. Then it is able to do most of the work with bitwise operations, which are extremely fast. Due to the data structures required by the algorithm, it performs best on patterns less than a constant length (typically the word length of the machine in question), and also prefers inputs over a small alphabet. Bitap algorithm
Needleman–Wunsch algorithm The Needleman–Wunsch algorithm is an algorithm used in bioinformatics to align protein or nucleotide sequences. It was published in 1970 by Saul B. Needleman and Christian D. Wunsch; it uses dynamic programming, and was the first application of dynamic programming to biological sequence comparison.
Smith–Waterman algorithm The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings or nucleotide or protein sequences. Instead of looking at the total sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure. The algorithm was first proposed by Temple F. Smith and Michael S. Waterman in 1981. Like the Needleman–Wunsch algorithm, of which it is a variation, Smith–Waterman is a dynamic programming algorithm. As such, it has the desirable property that it is guaranteed to find the optimal local alignment with respect to the scoring system being used (which includes the substitution matrix and the gap-scoring scheme).
Matching, diffing and merging XML Update: A newer, more complete version is here. I've said bad things about my job working on Carleton College's website, but fundamentally it's a really sound work environment we have. Just before winter break, one of the full-time employees came to me and asked if I could make a diff between two XHTML documents for use in Carleton's CMS, Reason. This would be useful for (a) comparing versions of a document in the CMS (b) merging documents, in case two people edit the same document at the same time, so as to avoid locks and the need for manual merges. They came to me because I told them I'd written an XML parser.
Daisy Diff is a Java library that diffs (compares) HTML files. It highlights added and removed words and annotates changes to the styling. (Examples) This project was a Google Summer of Code 2007 project for DaisyCMS where it's actively used for diffing HTML content. As a spin-off, a PHP version of the algorithm was developed for MediaWiki in the GSoC 2008. The Java version is licensed under the Apache License v2. daisydiff - Project Hosting on Google Code
Java Notes by Stuart D. Gathman Last updated Apr 01, 2013 I have moved the most popular items to the top of the menu. Obsolete, but historically interesting Class Packager for Java
i’ve spent the last week trying to get my brain wrapped around the issues involved with performing XML comparisons and how they might be solved. i also decided early on in my reading and discussions that i’d better attack this problem in stages, otherwise i’d have nothing to show for it for a very long time. in other words, this is a non-trivial problem. at least for me <g> so what makes this a hard problem? after all, there’s a ton of “difference” programs out there, like diff, windiff, tkdiff, etc. why not take the two XML-documents to be compared and run one of these programs over them and see what it says? it will certainly tell us something, why can’t we stop there? exploring the problems involved in comparing XML - O'Reilly ONLamp Blog
16 May 2013 - Proposed move to Apache2 license As part of restarting the project, I intend to relicense the project as Apache2 next month. As it stands, all code committed to the project so far has been written by myself, so I don't believe there are any copyright issues to address. However, if anyone is opposed to this change, please let me know immediately. diffxml - XML Diff and Patch Utilities
Open Source XML Diff Written in Java