background preloader

Data mining

Data mining
Data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.[1] Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information (with intelligent methods) from a data set and transform the information into a comprehensible structure for further use.[1][2][3][4] Data mining is the analysis step of the "knowledge discovery in databases" process or KDD.[5] Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.[1] Etymology[edit] In the 1960s, statisticians and economists used terms like data fishing or data dredging to refer to what they considered the bad practice of analyzing data without an a-priori hypothesis.

Related:  Aristotle, Organon

Irrelevant conclusion Irrelevant conclusion should not be confused with formal fallacy, an argument whose conclusion does not follow from its premises. Overview[edit] Ignoratio elenchi is one of the fallacies identified by Aristotle in his Organon. Carrot2 Carrot²[1] is an open source search results clustering engine.[2] It can automatically cluster small collections of documents, e.g. search results or document abstracts, into thematic categories. Apart from two specialized search results clustering algorithms, Carrot² offers ready-to-use components for fetching search results from various sources. Carrot² is written in Java and distributed under the BSD license. History[edit] The initial version of Carrot² was implemented in 2001 by Dawid Weiss as part of his MSc thesis to validate the applicability of the STC clustering algorithm to clustering search results in Polish.[3] In 2003, a number of other search results clustering algorithms were added, including Lingo,[4] a novel text clustering algorithm designed specifically for clustering of search results.

Data warehouse Data Warehouse Overview In computing, a data warehouse (DW, DWH), or an enterprise data warehouse (EDW), is a database used for reporting and data analysis. Integrating data from one or more disparate sources creates a central repository of data, a data warehouse (DW). Data warehouses store current and historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons. The data stored in the warehouse is uploaded from the operational systems (such as marketing, sales, etc., shown in the figure to the right).

Red herring A red herring is something that misleads or distracts from a relevant or important question.[1] It may be either a logical fallacy or a literary device that leads readers or audiences toward a false conclusion. A red herring may be used intentionally, as in mystery fiction or as part of rhetorical strategies (e.g., in politics), or may be used in argumentation inadvertently. The term was popularized in 1807 by English polemicist William Cobbett, who told a story of having used a kipper (a strong-smelling smoked fish) to divert hounds from chasing a hare.

General Architecture for Text Engineering GATE community and research has been involved in several European research projects including TAO, SEKT, NeOn, Media-Campaign, Musing, Service-Finder, LIRICS and KnowledgeWeb, as well as many other projects. As of May 28, 2011, 881 people are on the gate-users mailing list at, and 111,932 downloads from SourceForge are recorded since the project moved to SourceForge in 2005.[3] The paper "GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications"[4] has received over 800 citations in the seven years since publication (according to Google Scholar). Books covering the use of GATE, in addition to the GATE User Guide,[5] include "Building Search Applications: Lucene, LingPipe, and Gate", by Manu Konchady,[6] and "Introduction to Linguistic Annotation and Text Analytics", by Graham Wilcock.[7] Features[edit] JAPE transducers are used within GATE to manipulate annotations on text. GATE Developer[edit]

Knowledge extraction Knowledge extraction is the creation of knowledge from structured (relational databases, XML) and unstructured (text, documents, images) sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL (data warehouse), the main criteria is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge (reusing identifiers or ontologies) or the generation of a schema based on the source data. Overview[edit]

Post hoc ergo propter hoc Post hoc ergo propter hoc (Latin: "after this, therefore because of this") is a logical fallacy (of the questionable cause variety) that states "Since event Y followed event X, event Y must have been caused by event X." It is often shortened to simply post hoc. It is subtly different from the fallacy cum hoc ergo propter hoc (correlation does not imply causation), in which two things or events occur simultaneously or the chronological ordering is insignificant or unknown. Post hoc is a particularly tempting error because temporal sequence appears to be integral to causality. The fallacy lies in coming to a conclusion based solely on the order of events, rather than taking into account other factors that might rule out the connection.

download/index On this page you can find the latest stable release of GATE Developer and Embedded, as well as the latest nightly built snapshots. For other GATE products please go to or follow the links to the source code from our Sourceforge pages. NOTE: if you are upgrading from one version of GATE to another you must delete your user configuration file before running the new version. Knowledge retrieval Knowledge Retrieval seeks to return information in a structured form, consistent with human cognitive processes as opposed to simple lists of data items. It draws on a range of fields including epistemology (theory of knowledge), cognitive psychology, cognitive neuroscience, logic and inference, machine learning and knowledge discovery, linguistics, and information technology. Overview[edit]

Straw man A straw man is a common form of argument and is an informal fallacy based on giving the impression of refuting an opponent's argument, while actually refuting an argument that was not presented by that opponent.[1] One who engages in this fallacy is said to be "attacking a straw man." The typical straw man argument creates the illusion of having completely refuted or defeated an opponent's proposition through the covert replacement of it with a different proposition (i.e., "stand up a straw man") and the subsequent refutation of that false argument ("knock down a straw man") instead of the opponent's proposition.[2][3] This technique has been used throughout history in polemical debate, particularly in arguments about highly charged emotional issues where a fiery "battle" and the defeat of an "enemy" may be more valued than critical thinking or an understanding of both sides of the issue. Origin[edit]

UIMA UIMA is a component software architecture for the development, discovery, composition, and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies developed by IBM. The source code for a reference implementation of this framework has been made available on SourceForge, and later on the website of the Apache Software Foundation. Another use of UIMA is in systems that are used in medical contexts to analyze clinical notes, such as the Clinical Text Analysis and Knowledge Extraction System (CTAKES). Structure of UIMA[edit] The UIMA architecture can be thought of in four dimensions: IBM Watson - The Jeopardy Challenge[edit] Information retrieval Information retrieval is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches can be based on metadata or on full-text (or other content-based) indexing. Automated information retrieval systems are used to reduce what has been called "information overload". Many universities and public libraries use IR systems to provide access to books, journals and other documents. Web search engines are the most visible IR applications. Overview[edit]

Begging the question Begging the question is an informal fallacy that occurs when an argument's premises assume the truth of the conclusion, instead of supporting it. It is a type of circular reasoning: an argument that requires that the desired conclusion be true. This often occurs in an indirect way such that the fallacy's presence is hidden, or at least not easily apparent. The phrase begging the question originated in the 16th century as a mistranslation of the Latin petitio principii, which actually translates to "assuming the initial point".[1] In modern vernacular usage, "begging the question" is often[2] used to mean "raising the question" or "dodging the question".[1] In contexts that demand strict adherence to a technical definition of the term, many consider these usages incorrect.[3] Examples[edit]

Current Projects Semantic Information Management Semantic Information Retrieval (SIR-3) This project systematically investigates the semantic and lexical relationships between words and concepts and its usefulness in information retrieval (IR) tasks.

data mining: The process of exploring and analyzing large amounts of data to find patterns. Found in: Hurwitz, J., Nugent, A., Halper, F. & Kaufman, M. (2013) Big Data For Dummies. Hoboken, New Jersey, United States of America: For Dummies. ISBN: 9781118504222. by raviii Jan 1

Wiki, but a great starting point for Data Mining -Josh by fritzjl Mar 28

Data mining, a branch of computer science,[1] is the process of extracting patterns from large data sets by combining methods from statistics and artificial intelligence with database management. by agnesdelmotte Mar 24

Related:  data mining