background preloader

50+ Data Science and Machine Learning Cheat Sheets

50+ Data Science and Machine Learning Cheat Sheets
Gear up to speed and have Data Science & Data Mining concepts and commands handy with these cheatsheets covering R, Python, Django, MySQL, SQL, Hadoop, Apache Spark and Machine learning algorithms. Cheatsheets on Python, R and Numpy, Scipy, Pandas There are thousands of packages and hundreds of functions out there in the Data science world! An aspiring data enthusiast need not know all. Mastering Data science involves understanding of statistics, Mathematics, Programming knowledge especially in R, Python & SQL and then deploying a combination of all these to derive insights using the business understanding & a human instinct—that drives decisions. Here are the cheatsheets by category: Cheat sheets for Python: Python is a popular choice for beginners, yet still powerful enough to back some of the world’s most popular products and applications. Cheat sheets for R: The R's ecosystem has been expanding so much that a lot of referencing is needed. Share more & Learn! Related: Related:  gummibearehausenData Miningtips+utility

Python spell-checker for twiter stream · GitHub This is a simple python program that streams tweets from 2 locations, London and Exeter, in our example, and compares which one has the greatest number of spelling mistakes. 1 – Set-up used: *Ubuntu 11.04 Natty AMD64 *Python 2.7.3 *python re library *python nltk 2.0 library and the required NumPy and PyYaml (For NLP tasks) *python tweeterstream 1.1.1 library (For Tweeter Manipulation) *python enchant 1.6 library (For spelling verification) Installation Instructions: Python and python installation packages: from command prompt run: sudo apt-get install python python-pip python-setuptools NLTK, NumPy, PyYaml library: from command prompt run sudo pip install -U numpy sudo pip install -U pyyaml nltk Test installation: run python then type import nltk *Tweeterstream 1.1.1 Downoald at: decompress the file, enter directory and run: sudo python install *Python Enchant: Downoald at: Assumptions: Analysis assumptions:

Data Model Prototype | Computational Urban Design Research Studio | Page 5 Laster semester we utilize two kinds of clustering algorithms to do our analyze. The first one is distance based clustering, the second one is grid based clustering. Although logically they are very similar, both of them are forming clusters based on distances, they are different in doing this, and results can be different. Below is the logic of these 2 algorithms. A. distance based clustering: 1. 2. B. 1. 2. 3. Blow is the SQL for Grid based clustering WITH clstrtags AS ( SELECT *, tag.geom as tgeom FROM gridcluster(30,’urbantag’,’geom’) as grid JOIN urbantag as tag ON st_contains(st_setsrid(grid.geom,3435),st_setsrid(tag.geom,3435)) ORDER BY rid,cid ), counts AS (SELECT count(tagid) as count, clusterid, activity FROM clstrtags GROUP BY clusterid, activity), countss AS (SELECT count(tagid) as count, clusterid FROM clstrtags GROUP BY clusterid)

Academic Phrasebank – Referring to Sources One of the distinguishing features of academic writing is that it is informed by what is already known, what work has been done before, and/or what ideas and models have already been developed. Thus, in academic texts, writers frequently make reference to other studies and to the work of other authors. It is important that writers guide their readers through this literature. A note on referencing style: The way a writer refers to other sources varies somewhat across different disciplines. A note on verb tenses: For general reference to the literature, the present perfect tense (have/has + verb participle) tends to be used. Research into X has a long history. Most research on X has been carried out in … Most researchers investigating X have utilised … Using this approach, researchers have been able to … Several systematic reviews of X have been undertaken. X, Y and Z appear to be closely linked (Smith, 2008).

nlp - python module to remove internet jargon/slang/acronym Graph theory Refer to the glossary of graph theory for basic definitions in graph theory. Definitions[edit] Definitions in graph theory vary. The following are some of the more basic ways of defining graphs and related mathematical structures. Graph[edit] In the most common sense of the term,[1] a graph is an ordered pair of vertices or nodes together with a set of edges or lines, which are 2-element subsets of Other senses of graph stem from different conceptions of the edge set. is a set together with a relation of incidence that associates with each edge two vertices. is a multiset of unordered pairs of (not necessarily distinct) vertices. All of these variants and others are described more fully below. The vertices belonging to an edge are called the ends, endpoints, or end vertices of the edge. and are usually taken to be finite, and many of the well-known results are not true (or are rather different) for infinite graphs because many of the arguments fail in the infinite case. For an edge History[edit]

Online Statistics Education: A Free Resource for Introductory Statistics Developed by Rice University (Lead Developer), University of Houston Clear Lake, and Tufts University OnlineStatBook Project Home This work is in the public domain. Therefore, it can be copied and reproduced without limitation. However, we would appreciate a citation where possible. If you are an instructor using these materials, I can send you an instructor's manual, PowerPoint Slides, and additional questions that may be helpful to you. Table of Contents Mobile This version uses formatting that works better for mobile devices. Rice Virtual Lab in Statistics This is the original classic with all the simulations and case studies. Version in PDF e-Pub (e-book) Partial support for this work was provided by the National Science Foundation's Division of Undergraduate Education through grants DUE-9751307, DUE-0089435, and DUE-0919818.

text - Creating Vocabulary in python First Order Inductive Learner In machine learning, First Order Inductive Learner (FOIL) is a rule-based learning algorithm. Background[edit] Algorithm[edit] The FOIL algorithm is as follows: Input List of examples Output Rule in first-order predicate logic FOIL(Examples) Let Pos be the positive examples Let Pred be the predicate to be learned Until Pos is empty do: Let Neg be the negative examples Set Body to empty Call LearnClauseBody Add Pred ← Body to the rule Remove from Pos all examples which satisfy Body Procedure LearnClauseBody Until Neg is empty do: Choose a literal L Conjoin L to Body Remove from Neg examples that do not satisfy L Example[edit] Suppose FOIL's task is to learn the concept grandfather(X,Y) given the relations father(X,Y) and parent(X,Y). On the next iteration of FOIL after parent(X,Z) has been added, the algorithm will consider all combinations of predicate names and variables such that at least one variable in the new literal is present in the existing clause. Extensions[edit] Constraints[edit] Return Literal

IPython Books - IPython Cookbook IPython Interactive Computing and Visualization Cookbook This advanced-level book covers a wide range of methods for data science with Python: Interactive computing in the IPython notebook High-performance computing with Python Statistics, machine learning, data mining Signal processing and mathematical modeling Highlights 500+ pages100+ recipes15 chaptersEach recipe illustrates one method on one real-world exampleCode for Python 3 (but works fine on Python 2.7)All of the code freely available on GitHubContribute with issues and pull requests on GitHub This is an advanced-level book: basic knowledge of IPython, NumPy, pandas, matplotlib is required. Featured recipes A selection of free recipes from the book: Part I: Advanced High-Performance Interactive Computing Part I (chapters 1-6) covers advanced methods in interactive numerical computing, high-performance computing, and data visualization. Chapter 1: A Tour of Interactive Computing with IPython 2.1. Chapter 3: Mastering the Notebook

An Introduction to Text Mining using Twitter Streaming API and Python // Adil Moujahid // Data Analytics and more Text mining is the application of natural language processing techniques and analytical methods to text data in order to derive relevant information. Text mining is getting a lot attention these last years, due to an exponential increase in digital text data from web pages, google's projects such as google books and google ngram, and social media services such as Twitter. Twitter data constitutes a rich source that can be used for capturing information about any topic imaginable. This data can be used in different use cases such as finding trends related to a specific keyword, measuring brand sentiment, and gathering feedback about new products and services. In this tutorial, I will use Twitter data to compare the popularity of 3 programming languages: Python, Javascript and Ruby, and to retrieve links to programming tutorials. In the first paragraph, I will explaing how to connect to Twitter Streaming API and how to get the data. API stands for Application Programming Interface.

B+ tree A simple B+ tree example linking the keys 1–7 to data values d1-d7. The linked list (red) allows rapid in-order traversal. This particular tree's branching factor is A B+ tree is an N-ary tree with a variable but often large number of children per node. A B+ tree consists of a root, internal nodes and leaves. The root may be either a leaf or a node with two or more children.[2] A B+ tree can be viewed as a B-tree in which each node contains only keys (not key–value pairs), and to which an additional level is added at the bottom with linked leaves. Overview[edit] The order, or branching factor, b of a B+ tree measures the capacity of nodes (i.e., the number of children nodes) for internal nodes in the tree. and at most . Algorithms[edit] Search[edit] The root of a B+ Tree represents the whole range of values in the tree, where every internal node is a subinterval. We are looking for a value k in the B+ Tree. children, where every one of them represents a different sub-interval. Insertion[edit]

Color Wheel Pro: Classic Color Schemes Monochromatic color scheme The monochromatic color scheme uses variations in lightness and saturation of a single color. This scheme looks clean and elegant. Analogous color scheme The analogous color scheme uses colors that are adjacent to each other on the color wheel. Complementary color scheme The complementary color scheme is made of two colors that are opposite each other on the color wheel. When using the complementary scheme, it is important to choose a dominant color and use its complementary color for accents. Split complementary color scheme The split complementary scheme is a variation of the standard complementary scheme. Triadic color scheme The triadic color scheme uses three colors equally spaced around the color wheel. Tetradic (double complementary) color scheme The tetradic (double complementary) scheme is the richest of all the schemes because it uses four colors arranged into two complementary color pairs. Related topics: Color Theory Basics Visual vs.

Steps For Effective Text Data Cleaning The days when one would get data in tabulated spreadsheets are truly behind us. A moment of silence for the data residing in the spreadsheet pockets. Today, more than 80% of the data is unstructured – it is either present in data silos or scattered around the digital archives. Data is being produced as we speak – from every conversation we make in the social media to every content generated from news sources. In order to produce any meaningful actionable insight from data, it is important to know how to work with it in its unstructured form. As a Data Scientist at one of the fastest growing Decision Sciences firm, my bread and butter comes from deriving meaningful insights from unstructured text information. One of the first steps in working with text data is to pre-process it. In this blog, therefore I discuss about these possible noise elements and how you could clean them step by step. Steps for data cleaning: Here is what you do: Snippet: import HTMLParser Output:

Top 10 data mining algorithms in plain R Knowing the top 10 most influential data mining algorithms is awesome. Knowing how to USE the top 10 data mining algorithms in R is even more awesome. That’s when you can slap a big ol’ “S” on your chest… …because you’ll be unstoppable! Today, I’m going to take you step-by-step through how to use each of the top 10 most influential data mining algorithms as voted on by 3 separate panels in this survey paper. By the end of this post… You’ll have 10 insanely actionable data mining superpowers that you’ll be able to use right away. UPDATE 18-Jun-2015: Thanks to Albert for the creating the image above! UPDATE 22-Jun-2015: Thanks to Ulf for the fantastic feedback which I’ve included below. Getting Started First, what is R? R is both a language and environment for statistical computing and graphics. R has 2 key selling points: It’s a great environment for manipulating data, but if you’re on the fence between R and Python, lots of folks have compared them. For this post, do 2 things right now: Don’t wait!