Machine learning

On-line prediction wiki - Wiki for On-Line Prediction. This website is a wiki for research in On-line Prediction. It provides two resources: Overviews of various research topics. Descriptions of open questions. There are both articles about theory and articles about experimental results (we try to separate theory and experiments, covering them in different, often cross-referenced, articles). The wiki is hosted by a server in the Department of Computer Science at Royal Holloway, University of London.

If equations do not display properly on your browser, please follow the instructions at Latex Support to correct this. To edit a page, click the "Edit" button at the top of the page. . Unfortunately we need to make a password for editing pages due to some thoughtless hacker attacks. Graphical model. An example of a graphical model. Each arrow indicates a dependency. In this example: D depends on A, D depends on B, D depends on C, C depends on B, and C depends on D. Types of graphical models[edit] Generally, probabilistic graphical models use a graph-based representation as the foundation for encoding a complete distribution over a multi-dimensional space and a graph that is a compact or factorized representation of a set of independences that hold in the specific distribution.

Two branches of graphical representations of distributions are commonly used, namely, Bayesian networks and Markov networks. Bayesian network[edit] If the network structure of the model is a directed acyclic graph, the model represents a factorization of the joint probability of all random variables. Then the joint probability satisfies where is the set of parents of node . With a joint probability density that factors as Any two nodes are conditionally independent given the values of their parents. See also[edit] Inference. Inference is the act or process of deriving logical conclusions from premises known or assumed to be true.[1] The conclusion drawn is also called an idiomatic.

The laws of valid inference are studied in the field of logic. Alternatively, inference may be defined as the non-logical, but rational means, through observation of patterns of facts, to indirectly see new meanings and contexts for understanding. Of particular use to this application of inference are anomalies and symbols. Inference, in this sense, does not draw conclusions but opens new paths for inquiry. Human inference (i.e. how humans draw conclusions) is traditionally studied within the field of cognitive psychology; artificial intelligence researchers develop automated inference systems to emulate human inference.

Statistical inference uses mathematics to draw conclusions in the presence of uncertainty. Examples[edit] All men are mortalSocrates is a manTherefore, Socrates is mortal. Now we turn to an invalid form. ? (where ? ? Convex polytope. A 3-dimensional convex polytope A convex polytope is a special case of a polytope, having the additional property that it is also a convex set of points in the n-dimensional space Rn.[1] Some authors use the terms "convex polytope" and "convex polyhedron" interchangeably, while others prefer to draw a distinction between the notions of a polyhedron and a polytope. In addition, some texts require a polytope to be a bounded set, while others[2] (including this article) allow polytopes to be unbounded. The terms "bounded/unbounded convex polytope" will be used below whenever the boundedness is critical to the discussed issue.

Yet other texts treat a convex n-polytope as a surface or (n-1)-manifold. Convex polytopes play an important role both in various branches of mathematics and in applied areas, most notably in linear programming. A comprehensive and influential book in the subject, called Convex Polytopes, was published in 1967 by Branko Grünbaum. Examples[edit] Definitions[edit] . UAI - Uncertainty in Artificial Intelligence. Markov chain Monte Carlo. In statistics, Markov chain Monte Carlo (MCMC) methods (which include random walk Monte Carlo methods) are a class of algorithms for sampling from probability distributions based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. The state of the chain after a large number of steps is then used as a sample of the desired distribution. The quality of the sample improves as a function of the number of steps.

Convergence of the Metropolis-Hastings algorithm. MCMC attempts to approximate the blue distribution with the orange distribution Usually it is not hard to construct a Markov chain with the desired properties. Typical use of MCMC sampling can only approximate the target distribution, as there is always some residual effect of the starting position. The most common application of these algorithms is numerically calculating multi-dimensional integrals. Random walk algorithms[edit] Non-random walk options[edit] Changing dimension[edit] Belief propagation. Belief propagation, also known as sum-product message passing is a message passing algorithm for performing inference on graphical models, such as Bayesian networks and Markov random fields. It calculates the marginal distribution for each unobserved node, conditional on any observed nodes. Belief propagation is commonly used in artificial intelligence and information theory and has demonstrated empirical success in numerous applications including low-density parity-check codes, turbo codes, free energy approximation, and satisfiability.[1] If X=(Xv) is a set of discrete random variables with a joint mass function p, the marginal distribution of a single Xi is simply the summation of p over all other variables: Description of the sum-product algorithm[edit] Variants of the Belief propagation algorithm exist for several types of graphical models (bayesian network and markov random field,[5] in particular).

We describe here the variant that operates on a factor graph. ) and from u to v ( Artificial Intelligence : AND/OR Branch-and-Bound search for combinatorial optimization in graphical models. Open Archive Abstract This is the first of two papers presenting and evaluating the power of a new framework for combinatorial optimization in graphical models, based on AND/OR search spaces. We introduce a new generation of depth-first Branch-and-Bound algorithms that explore the AND/OR search tree using static and dynamic variable orderings. The virtue of the AND/OR representation of the search space is that its size may be far smaller than that of a traditional OR representation, which can translate into significant time savings for search algorithms. The focus of this paper is on linear space search which explores the AND/OR search tree. Keywords Search; AND/OR search; Decomposition; Graphical models; Bayesian networks; Constraint networks; Constraint optimization.

Markov blanket. In a Bayesian network, the Markov blanket of node A includes its parents, children and the other parents of all of its children. in a Bayesian network is the set of nodes composed of 's parents, its children, and its children's other parents. In a Markov network, the Markov blanket of a node is its set of neighboring nodes. A Markov blanket may also be denoted by Every set of nodes in the network is conditionally independent of when conditioned on the set , that is, when conditioned on the Markov blanket of the node . And The Markov blanket of a node contains all the variables that shield the node from the rest of the network. In a Bayesian network, the values of the parents and children of a node evidently give information about that node; however, its children's parents also have to be included, because they can be used to explain away the node in question.

See also[edit] Moral graph Notes[edit] Jump up ^ Pearl, Judea (1988). Bayesian network. A simple Bayesian network. Rain influences whether the sprinkler is activated, and both rain and the sprinkler influence whether the grass is wet. A Bayesian network, Bayes network, belief network, Bayes(ian) model or probabilistic directed acyclic graphical model is a probabilistic graphical model (a type of statistical model) that represents a set of random variables and their conditional dependencies via a directed acyclic graph (DAG). For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases. Formally, Bayesian networks are DAGs whose nodes represent random variables in the Bayesian sense: they may be observable quantities, latent variables, unknown parameters or hypotheses.

Edges represent conditional dependencies; nodes that are not connected represent variables that are conditionally independent of each other. Example[edit] . Markov random field. An example of a Markov random field. Each edge represents dependency. In this example: A depends on B and D. B depends on A and D. D depends on A, B, and E. E depends on D and C. In the domain of physics and probability, a Markov random field (often abbreviated as MRF), Markov network or undirected graphical model is a set of random variables having a Markov property described by an undirected graph.

Definition[edit] Given an undirected graph G = (V, E), a set of random variables X = (Xv)v ∈ V indexed by V form a Markov random field with respect to G if they satisfy the local Markov properties: Pairwise Markov property: Any two non-adjacent variables are conditionally independent given all other variables: Local Markov property: A variable is conditionally independent of all other variables given its neighbors: Global Markov property: Any two subsets of variables are conditionally independent given a separating subset: where every path from a node in A to a node in B passes through S.

Here, Multi-armed bandit. A row of slot machines in Las Vegas. In probability theory, the multi-armed bandit problem (sometimes called the K-[1] or N-armed bandit problem[2]) is a problem in which a gambler at a row of slot machines (sometimes known as "one-armed bandits") has to decide which machines to play, how many times to play each machine and in which order to play them.[3] When played, each machine provides a random reward from a probability distribution specific to that machine.

The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls.[4][5] Herbert Robbins in 1952, realizing the importance of the problem, constructed convergent population selection strategies in "some aspects of the sequential design of experiments".[6] A theorem, the Gittins index, first published by John C. Gittins, gives an optimal policy for maximizing the expected discounted reward.[7] Empirical motivation[edit] The multi-armed bandit model[edit] levers.

After , where , and Variations[edit] . Contextual Bandits. One of the fundamental underpinnings of the internet is advertising based content. This has become much more effective due to targeted advertising where ads are specifically matched to interests. Everyone is familiar with this, because everyone uses search engines and all search engines try to make money this way. The problem of matching ads to interests is a natural machine learning problem in some ways since there is much information in who clicks on what. A fundamental problem with this information is that it is not supervised—in particular a click-or-not on one ad doesn’t generally tell you if a different ad would have been clicked on.

This implies we have a fundamental exploration problem. A standard mathematical setting for this situation is “k-Armed Bandits”, often with various relevant embellishments. The k-Armed Bandit setting works on a round-by-round basis. As information is accumulated over multiple rounds, a good policy might converge on a good choice of arm (i.e. ad). Gaussian process. Gaussian processes are important in statistical modelling because of properties inherited from the normal. For example, if a random process is modelled as a Gaussian process, the distributions of various derived quantities can be obtained explicitly. Such quantities include: the average value of the process over a range of times; the error in estimating the average using sample values at a small set of times.

Definition[edit] Some authors[3] assume the random variables Xt have mean zero; this greatly simplifies calculations without loss of generality and allows the mean square properties of the process to be entirely determined by the covariance function K.[4] Alternative definitions[edit] Alternatively, a process is Gaussian if and only if for every finite set of indices in the index set is Gaussian if and only if, for every finite set of indices , there are real valued with such that The numbers and Covariance functions[edit] Usual covariance functions[edit] Here . Have to be for . See also[edit] Empirical risk minimization.

Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on the performance of learning algorithms. Background[edit] Consider the following situation, which is a general setting of many supervised learning problems. We have two spaces of objects and and would like to learn a function (often called hypothesis) which outputs an object , given . Where is an input and is the corresponding response that we wish to get from To put it more formally, we assume that there is a joint probability distribution over , and that the training set consists of instances drawn i.i.d. from . Is not a deterministic function of , but rather a random variable with conditional distribution for a fixed We also assume that we are given a non-negative real-valued loss function which measures how different the prediction of a hypothesis is from the true outcome is then defined as the expectation of the loss function: , where.

VC dimension. Informally, the capacity of a classification model is related to how complicated it can be. For example, consider the thresholding of a high-degree polynomial: if the polynomial evaluates above zero, that point is classified as positive, otherwise as negative. A high-degree polynomial can be wiggly, so it can fit a given set of training points well. But one can expect that the classifier will make errors on other points, because it is too wiggly.

Such a polynomial has a high capacity. A much simpler alternative is to threshold a linear function. Shattering[edit] A classification model with some parameter vector is said to shatter a set of data points if, for all assignments of labels to those points, there exists a such that the model makes no errors when evaluating that set of data points. The VC dimension of a model is the maximum number of points that can be arranged so that shatters them. Where is the maximum such that some data point set of cardinality can be shattered by Uses[edit] ). Hoeffding's inequality. Chebyshev's inequality. Doob martingale. Rademacher complexity. Bartlett , Bousquet , Mendelson : Local Rademacher complexities. M-estimator. Minimax. Principal component analysis. Independent component analysis. Johnson–Lindenstrauss lemma. Binary classification. LPBoost. AdaBoost. Support vector machine. Decision tree. Logistic regression.

Linear regression. Kernel methods.