background preloader


Pattern is a web mining module for the Python programming language. It has tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser), natural language processing (part-of-speech taggers, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, clustering, SVM), network analysis and <canvas> visualization. The module is free, well-document and bundled with 50+ examples and 350+ unit tests. Download Installation Pattern is written for Python 2.5+ (no support for Python 3 yet). To install Pattern so that the module is available in all Python scripts, from the command line do: > cd pattern-2.6 > python install If you have pip, you can automatically download and install from the PyPi repository: If none of the above works, you can make Python aware of the module in three ways: Quick overview pattern.web pattern.en The pattern.en module is a natural language processing (NLP) toolkit for English. pattern.vector Case studies Related:  logank1Pidgin project

100 days of web mining In this experiment, we collected Google News stories at regular 1-hour intervals between November 22, 2010, and March 8, 2011, resulting in a set of 6,405 news stories. We grouped these per day and then determined the top daily keywords using tf-idf, a measurement of a word's uniqueness or importance. For example: if the word news is mentioned every day, it is not particularly unique at any single given day. To set up the experiment we used the Pattern web mining module for Python.The basic script is simple enough: Your code will probably have some preprocessing steps to save and load the mined news updates. In the image below, important words (i.e., events) that occured across multiple days are highlighted (we took a word's document frequency as an indication). See full size image Simultaneously, we mined Twitter messages containing the words I love or I hate – 35,784 love-tweets and 35,212 hate-tweets in total. Daily drudge Here are the top keywords of hate-tweets grouped by day:

FlowingData | Data Visualization and Statistics Time Series analysis tsa — statsmodels 0.7.0 documentation statsmodels.tsa contains model classes and functions that are useful for time series analysis. This currently includes univariate autoregressive models (AR), vector autoregressive models (VAR) and univariate autoregressive moving average models (ARMA). It also includes descriptive statistics for time series, for example autocorrelation, partial autocorrelation function and periodogram, as well as the corresponding theoretical properties of ARMA or related processes. It also includes methods to work with autoregressive and moving average lag-polynomials. Estimation is either done by exact or conditional Maximum Likelihood or conditional least-squares, either using Kalman Filter or direct filters. Currently, functions and classes have to be imported from the corresponding module, but the main classes will be made available in the statsmodels.tsa namespace. Some related functions are also available in matplotlib, nitime, and scikits.talkbox. Descriptive Statistics and Tests Estimation

pattern.web The pattern.web module has tools for online data mining: asynchronous requests, a uniform API for web services (Google, Bing, Twitter, Facebook, Wikipedia, Wiktionary, Flickr, RSS), a HTML DOM parser, HTML tag stripping functions, a web crawler, webmail, caching, Unicode support. It can be used by itself or with other pattern modules: web | db | en | search | vector | graph. Documentation URLs The URL object is a subclass of Python's urllib2.Request that can be used to connect to a web address. GET: query data is encoded in the URL string (usually for retrieving data).POST: query data is encoded in the message body (for posting data). URL() expects a string that starts with a valid protocol (e.g. The example below downloads an image. URL downloads The download() function takes a URL string, calls and returns the retrieved data. URL mime-type The URL.mimetype can be used to check the type of document at the given URL. URL exceptions User-agent and referrer Find URLs Wikia

dfhoughton/StanfordCFG Grammatical Features - Aspect Anna Kibort 1. What is 'aspect' The term 'aspect' designates the perspective taken on the internal temporal organisation of the situation, and so 'aspects' distinguish different ways of viewing the internal temporal constituency of the same situation (Comrie 1976:3ff, after Holt 1943:6; Bybee 2003:157). Aspectual meaning of a clause can be broken up into two independent aspectual components (Smith 1991/1997): Aspectual viewpoint - this is the temporal perspective from which the situation is presented. Aspectual meaning of a clause results from the interaction of aspectual viewpoint and situation type. Jump to top of page/ top of section 2. Aspectual characteristics are coded in a wide range of ways: lexical, derivational, or inflectional; synthetic ('morphological') and analytic ('syntactic'). Verbs tend to have inherent aspectual meaning because the situations described by them tend to have inherent temporal properties. Jump to top of page/ top of section 3. 4.

Медико-экологический атлас Воронежской области: Об атласе Beautiful Soup Documentation — Beautiful Soup v4.0.0 documentation Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work. These instructions illustrate all major features of Beautiful Soup 4, with examples. The examples in this documentation should work the same way in Python 2.7 and Python 3.2. You might be looking for the documentation for Beautiful Soup 3. This documentation has been translated into other languages by Beautiful Soup users: 这篇文档当然还有中文版.このページは日本語で利用できます(外部リンク)이 문서는 한국어 번역도 가능합니다. Here’s an HTML document I’ll be using as an example throughout this document. Running the “three sisters” document through Beautiful Soup gives us a BeautifulSoup object, which represents the document as a nested data structure: Here are some simple ways to navigate that data structure: One common task is extracting all the URLs found within a page’s <a> tags: Tag Name A list