Pattern

Pattern is a web mining module for the Python programming language. It has tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser), natural language processing (part-of-speech taggers, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, clustering, SVM), network analysis and <canvas> visualization. The module is free, well-document and bundled with 50+ examples and 350+ unit tests. Download Installation Pattern is written for Python 2.5+ (no support for Python 3 yet). To install Pattern so that the module is available in all Python scripts, from the command line do: > cd pattern-2.6 > python setup.py install If you have pip, you can automatically download and install from the PyPi repository: If none of the above works, you can make Python aware of the module in three ways: Quick overview pattern.web pattern.en The pattern.en module is a natural language processing (NLP) toolkit for English. pattern.search pattern.vector Case studies Related: logank1

100 days of web mining In this experiment, we collected Google News stories at regular 1-hour intervals between November 22, 2010, and March 8, 2011, resulting in a set of 6,405 news stories. We grouped these per day and then determined the top daily keywords using tf-idf, a measurement of a word's uniqueness or importance. For example: if the word news is mentioned every day, it is not particularly unique at any single given day. To set up the experiment we used the Pattern web mining module for Python.The basic script is simple enough: Your code will probably have some preprocessing steps to save and load the mined news updates. In the image below, important words (i.e., events) that occured across multiple days are highlighted (we took a word's document frequency as an indication). See full size image Simultaneously, we mined Twitter messages containing the words I love or I hate – 35,784 love-tweets and 35,212 hate-tweets in total. Daily drudge Here are the top keywords of hate-tweets grouped by day:

Time Series analysis tsa — statsmodels 0.7.0 documentation statsmodels.tsa contains model classes and functions that are useful for time series analysis. This currently includes univariate autoregressive models (AR), vector autoregressive models (VAR) and univariate autoregressive moving average models (ARMA). It also includes descriptive statistics for time series, for example autocorrelation, partial autocorrelation function and periodogram, as well as the corresponding theoretical properties of ARMA or related processes. It also includes methods to work with autoregressive and moving average lag-polynomials. Estimation is either done by exact or conditional Maximum Likelihood or conditional least-squares, either using Kalman Filter or direct filters. Currently, functions and classes have to be imported from the corresponding module, but the main classes will be made available in the statsmodels.tsa namespace. Some related functions are also available in matplotlib, nitime, and scikits.talkbox. Descriptive Statistics and Tests Estimation

Grammatical Features - Aspect Anna Kibort 1. What is 'aspect' The term 'aspect' designates the perspective taken on the internal temporal organisation of the situation, and so 'aspects' distinguish different ways of viewing the internal temporal constituency of the same situation (Comrie 1976:3ff, after Holt 1943:6; Bybee 2003:157). Aspectual meaning of a clause can be broken up into two independent aspectual components (Smith 1991/1997): Aspectual viewpoint - this is the temporal perspective from which the situation is presented. Aspectual meaning of a clause results from the interaction of aspectual viewpoint and situation type. Jump to top of page/ top of section 2. Aspectual characteristics are coded in a wide range of ways: lexical, derivational, or inflectional; synthetic ('morphological') and analytic ('syntactic'). Verbs tend to have inherent aspectual meaning because the situations described by them tend to have inherent temporal properties. Jump to top of page/ top of section 3. 4.

Beautiful Soup Documentation — Beautiful Soup v4.0.0 documentation Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work. These instructions illustrate all major features of Beautiful Soup 4, with examples. This document covers Beautiful Soup version 4.12.1. You might be looking for the documentation for Beautiful Soup 3. This documentation has been translated into other languages by Beautiful Soup users: Getting help If you have questions about Beautiful Soup, or run into problems, send mail to the discussion group. When reporting an error in this documentation, please mention which translation you’re reading. Here’s an HTML document I’ll be using as an example throughout this document. Running the “three sisters” document through Beautiful Soup gives us a BeautifulSoup object, which represents the document as a nested data structure: $ apt-get install python3-bs4

True Knowledge — The Internet Answer Engine The Story of Evi Evi was founded in August 2005, originally under the name of True Knowledge, with the mission of powering a new kind of search experience where users can access the world’s knowledge simply by asking for the information they need in a way that is completely natural. The True Knowledge internet answer engine was launched in 2007 to excellent response from users who were not only able to access the wealth of information Evi could provide, but were able to contribute directly to the ever growing database of facts. In 2011 development began on Evi, a brand new AI that advanced on the technology within the True Knowledge platform and which users would be able to interact with via her own mobile app. In October 2012, Evi was acquired by Amazon and is proud to now be part of the Amazon group of companies. What we do? Evi’s mission is to help people get what they want and need through our understanding of each user and the world they live in.

Terminology Extraction Introduction Terminology is the sum of the terms which identify a specific topic. Extracting terminology is the process of extracting terminology from a text. The idea is to compare the frequency of words in a given document with their frequency in the language. Technology It uses Poisson statistics, the Maximum Likelihood Estimation and Inverse Document Frequency between the frequency of words in a given document and a generic corpus of 100 million words per language. Why have we developed this? Translated has developed this technology to help its translators to be aware of the difficulties in a document and to simplify the process of creating glossaries. We also use it to improve search results in traditional search engines (es. I want it! If you are interested in this technology, please read more on Translated Labs and our services for natural language processing. I could do better!

Computer Networking : Principles, Protocols and Practice | INL: IP Networking Lab Computer Networking : Principles, Protocols and Practice (aka CNP3) is an ongoing effort to develop an open-source networking textbook that could be used for an in-depth undergraduate or graduate networking courses. The first edition of the textbook used the top-down approach initially proposed by Jim Kurose and Keith Ross for their Computer Networks textbook published by Addison Wesley. CNP3 is distributed under a creative commons license. The second edition takes a different approach. The new features of the second edition are : The second edition of the ebook is now divided in two main parts The first part of the ebook uses a bottom-up approach and focuses on the principles of the computer networks without entering into protocol and practical details. Numerous exercises are also provided as well as interactive quizzes that enable the students to verify their understanding of the different chapters and lab experiments with netkit and other software tools. First edition of the textbook