background preloader

Beautiful Soup Documentation — Beautiful Soup v4.0.0 documentation

Beautiful Soup Documentation — Beautiful Soup v4.0.0 documentation
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work. These instructions illustrate all major features of Beautiful Soup 4, with examples. I show you what the library is good for, how it works, how to use it, how to make it do what you want, and what to do when it violates your expectations. The examples in this documentation should work the same way in Python 2.7 and Python 3.2. You might be looking for the documentation for Beautiful Soup 3. This documentation has been translated into other languages by Beautiful Soup users: 这篇文档当然还有中文版.このページは日本語で利用できます(外部リンク)이 문서는 한국어 번역도 가능합니다. Here’s an HTML document I’ll be using as an example throughout this document. Here are some simple ways to navigate that data structure: One common task is extracting all the URLs found within a page’s <a> tags: Tag Name Related:  Pythonlogank1

Setting up Python in Windows 7 | Anthony DeBarros An all-wise journalist once told me that “everything is easier in Linux,” and after working with it for a few years I’d have to agree — especially when it comes to software setup for data journalism. But … Many newsroom types spend the day in Windows without the option of Ubuntu or another Linux OS. I’ve been planning some training around Python soon, so I compiled this quick setup guide as a reference. I hope you find it helpful. Set up Python on Windows 7 Get started: 1. Note: Python currently exists in two versions, the older 2.x series and newer 3.x series (for a discussion of the differences, see this). 2. 3. Right-click Computer and select Properties.In the dialog box, select Advanced System Settings.In the next dialog, select Environment Variables.In the User Variables section, edit the PATH statement to include this: 4. That will load the Python interpreter: Press Control-Z plus Return to exit the interpreter and get back to a C: prompt. Set up useful Python packages 3.

9.7. itertools — Functions creating iterators for efficient looping This module implements a number of iterator building blocks inspired by constructs from APL, Haskell, and SML. Each has been recast in a form suitable for Python. The module standardizes a core set of fast, memory efficient tools that are useful by themselves or in combination. For instance, SML provides a tabulation tool: tabulate(f) which produces a sequence f(0), f(1), .... These tools and their built-in counterparts also work well with the high-speed functions in the operator module. Infinite Iterators: Iterators terminating on the shortest input sequence: Combinatoric generators: 9.7.1. The following module functions all construct and return iterators. itertools.chain(*iterables) Make an iterator that returns elements from the first iterable until it is exhausted, then proceeds to the next iterable, until all of the iterables are exhausted. def chain(*iterables): # chain('ABC', 'DEF') --> A B C D E F for it in iterables: for element in it: yield element Roughly equivalent to: 9.7.2.

Pandas Pivot Table Explained - Practical Business Python Introduction Most people likely have experience with pivot tables in Excel. Pandas provides a similar function called (appropriately enough) pivot_table . While it is exceedingly useful, I frequently find myself struggling to remember how to use the syntax to format the output for my needs. If you are not familiar with the concept, wikipedia explains it in high level terms. As an added bonus, I’ve created a simple cheat sheet that summarizes the pivot_table. The Data One of the challenges with using the panda’s pivot_table is making sure you understand your data and what questions you are trying to answer with the pivot table. In this scenario, I’m going to be tracking a sales pipeline (also called funnel). Typical questions include: How much revenue is in the pipeline? Many companies will have CRM tools or other software that sales uses to track the process. Using a panda’s pivot table can be a good alternative because it is: Read in the data Let’s set up our environment first. Columns vs.

Time Series analysis tsa — statsmodels 0.7.0 documentation statsmodels.tsa contains model classes and functions that are useful for time series analysis. This currently includes univariate autoregressive models (AR), vector autoregressive models (VAR) and univariate autoregressive moving average models (ARMA). It also includes descriptive statistics for time series, for example autocorrelation, partial autocorrelation function and periodogram, as well as the corresponding theoretical properties of ARMA or related processes. It also includes methods to work with autoregressive and moving average lag-polynomials. Additionally, related statistical tests and some useful helper functions are available. Estimation is either done by exact or conditional Maximum Likelihood or conditional least-squares, either using Kalman Filter or direct filters. Currently, functions and classes have to be imported from the corresponding module, but the main classes will be made available in the statsmodels.tsa namespace. Descriptive Statistics and Tests Estimation

BeginnersGuide/NonProgrammers Python for Non-Programmers If you've never programmed before, the tutorials on this page are recommended for you; they don't assume that you have previous experience. If you have programming experience, also check out the BeginnersGuide/Programmers page. Books Each of these books can be purchased online and is also available as a completely free website. Automate the Boring Stuff with Python - Practical Programming for Total Beginners by Al Sweigart is "written for office workers, students, administrators, and anyone who uses a computer to learn how to code small, practical programs to automate tasks on their computer." Interactive Courses These sites give you instant feedback on programming problems that you can solve in your browser. CheckiO is a gamified website containing programming tasks that can be solved in Python 3. Resources for Younger Learners Build a "Pypet" Learn programming fundamentals in Python while building a tamagotchi style "Pypet" by Tatiana Tylosky. Videos Tools

JasperSnoek/spearmint Python Lists - Google for Education Python has a great built-in list type named "list". List literals are written within square brackets [ ]. Lists work similarly to strings -- use the len() function and square brackets [ ] to access data, with the first element at index 0. Assignment with an = on lists does not make a copy. The "empty list" is just an empty pair of brackets [ ]. Python's *for* and *in* constructs are extremely useful, and the first use of them we'll see is with lists. If you know what sort of thing is in the list, use a variable name in the loop that captures that information such as "num", or "name", or "url". The *in* construct on its own is an easy way to test if an element appears in a list (or other collection) -- value in collection -- tests if the value is in the collection, returning True/False. The for/in constructs are very commonly used in Python code and work on data types other than list, so you should just memorize their syntax. You can also use for/in to work on a string.

Pattern Pattern is a web mining module for the Python programming language. It has tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser), natural language processing (part-of-speech taggers, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, clustering, SVM), network analysis and <canvas> visualization. The module is free, well-document and bundled with 50+ examples and 350+ unit tests. Download Installation Pattern is written for Python 2.5+ (no support for Python 3 yet). To install Pattern so that the module is available in all Python scripts, from the command line do: > cd pattern-2.6 > python setup.py install If you have pip, you can automatically download and install from the PyPi repository: If none of the above works, you can make Python aware of the module in three ways: Quick overview pattern.web pattern.en The pattern.en module is a natural language processing (NLP) toolkit for English. pattern.search pattern.vector Case studies

Tutorial — Scrapy 0.15.1 documentation In this tutorial, we’ll assume that Scrapy is already installed on your system. If that’s not the case, see Installation guide. We are going to use Open directory project (dmoz) as our example domain to scrape. This tutorial will walk you through these tasks: Creating a new Scrapy projectDefining the Items you will extractWriting a spider to crawl a site and extract ItemsWriting an Item Pipeline to store the extracted Items Scrapy is written in Python. Creating a project¶ Before you start scraping, you will have set up a new Scrapy project. scrapy startproject tutorial This will create a tutorial directory with the following contents: tutorial/ scrapy.cfg tutorial/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ... These are basically: Defining our Item¶ Items are containers that will be loaded with the scraped data; they work like simple python dicts but provide additional protecting against populating undeclared fields, to prevent typos. Our first Spider¶ Crawling¶ Note [ ...

FMin · hyperopt/hyperopt Wiki This page is a tutorial on basic usage of hyperopt.fmin(). It covers how to write an objective function that fmin can optimize, and how to describe a search space that fmin can search. Hyperopt's job is to find the best value of a scalar-valued, possibly-stochastic function over a set of possible arguments to that function. The way to use hyperopt is to describe: the objective function to minimizethe space over which to searchthe database in which to store all the point evaluations of the searchthe search algorithm to use This (most basic) tutorial will walk through how to write functions and search spaces, using the default Trials database, and the dummy random search algorithm. Parallel search is possible when replacing the Trials database with a MongoTrials one; there is another wiki page on the subject of using mongodb for parallel search. Choosing the search algorithm is as simple as passing algo=hyperopt.tpe.suggest instead of algo=hyperopt.random.suggest. 1. 1.1 The Simplest Case 2.

Python Boot Camp The BootCamp has wrapped up ... if you took the bootcamp please give us feedback. We took over all the Brower Center! This is the main site for the Python Boot Camp Fall 2013, at the Brower Center (2150 Allston Way) near the UC Berkeley Campus from August 26 (Monday) to August 28 (Wednesday) 2013. If you are interested in a detailed introduction to various Pythonic topics to aid in research, consider the "Python Computing for Data Scientists" seminar course (Thursday 1-4pm; Fall 2013, CC #06080). You must have registered to participate in the Camp and doing so early means we can stay in touch with you with any new developments.

Related: