Data Education

Facebook Twitter
IDE - Overview IDE - Overview NetBeans IDE lets you quickly and easily develop Java desktop, mobile, and web applications, as well as HTML5 applications with HTML, JavaScript, and CSS. The IDE also provides a great set of tools for PHP and C/C++ developers. It is free and open source and has a large community of users and developers around the world.
screen_scraping | Tyler Lesmann
screen_scraping | Tyler Lesmann
pyquery: a jquery-like library for python — pyquery 1.2.4 documentation pyquery allows you to make jquery queries on xml documents. The API is as much as possible the similar to jquery. pyquery uses lxml for fast xml and html manipulation. This is not (or at least not yet) a library to produce or interact with javascript code. pyquery: a jquery-like library for python — pyquery 1.2.4 documentation
ironmacro - GUI Automation for .NET
pyscraper - simple python based HTTP screen scraper
xkcd-viewer - A small test project using screen scraping
juicedpyshell - This Python Firefox Shell Mashup lets you automate Firefox browser using python scripts juicedpyshell - This Python Firefox Shell Mashup lets you automate Firefox browser using python scripts The Juiced Python Firefox Shell lets you automate a browser using python scripts. It requires that the pyxpcomext extension be installed. It is useful for browser automation, including automated testing of web sites. It makes it easy to do screen scraping and html manipulation using Python. This project is a fork of the pyShell project.
Simple extraction Except project title from the Google Code page: from webscraping import download, xpathD = download.Download()# download and cache the Google Code webpagehtml = D.get('http://code.google.com/p/webscraping')# use xpath to extract the project titleproject_title = xpath.get(html, '//div[@id="pname"]/a/span') Blog scraper Scrape all articles from a blog Examples — webscraping documentation Examples — webscraping documentation
Scrapemark - Documentation Scrapemark - Documentation ScrapeMark is analagous to a regular expression engine. A ‘pattern’ with special syntax is applied to the HTML being scraped. If the pattern correctly matches, captured values are returned. ScrapeMark’s pattern syntax is simpler than regular expression syntax and is optimized for use with HTML.
12.2 Parsing HTML documents This section only applies to user agents, data mining tools, and conformance checkers. The rules for parsing XML documents into DOM trees are covered by the next section, entitled "The XHTML syntax". User agents must use the parsing rules described in this section to generate the DOM trees from text/html resources. 12.2 Parsing HTML documents 12.2 Parsing HTML documents
Stateful programmatic web browsing in Python, after Andy Lester’s Perl module WWW::Mechanize. The examples below are written for a website that does not exist (example.com), so cannot be run. There are also some working examples that you can run. mechanize

mechanize

Scrapemark - Easy Python Scraping Library NOTE: This project is no longer maintained! (more info) It utilizes an HTML-like markup language to extract the data you need. You get your results as plain old Python lists and dictionaries. Scrapemark internally utilizes regular expressions and is super-fast. Scrapemark - Easy Python Scraping Library
Fingal County Council of Ireland recently published a number of sets of Open Data, in nice clean CSV, XML and KML formats. Unfortunately, the one set of Open Data that was difficult to obtain, was the list of sets of open data. That’s because the list was separated into four separate pages. The important thing to observe is that Next >> link is no ordinary link. You can see something is wrong when you hover your cursor over it. Here’s what it looks like in the HTML source code: How to get along with an ASP webpage | ScraperWiki How to get along with an ASP webpage | ScraperWiki
screen_scraping | Tyler Lesmann
Using A Gui To Build Packages Not everyone is a command line junkie. Some folks actually prefer the comfort of a Windows GUI application for performing tasks such as package creation. The NuGet Package Explorer click-once application makes creating packages very easy. It's also a great way to examine packages and learn how packages are structured. If you’re integrating building packages into a build system, then using NuGet.exe to create and publish packages is a better choice. Using A Gui To Build Packages
Sponsored Links: Related Forum Messages For ASP.NET category: Scraping Text Of A Webpage? I need to scrape about the first 1000 visible words that appear on a web page. I am hoping something already exists like this so that I don't have to code it myself. C# - Scraping Content From Webpage?
Web scraping with Python Stack Exchange log in | careers 2.0 | chat | meta | about | faq Stack Overflow Web scraping with Python 7 Answers active oldest votes
If you just want to scrape a handful of sites with consistent formatting, the easiest thing would probably be to use requests combined with regular expressions and python's built-in string processing. import re import requests resp = requests.get('http://austin.craigslist.org/cto/') regex = ('<a href="(http://austin.craigslist.org/cto/[0-9]+\.html)">' '([a-zA-z0-9 ]+)</a>') for i, match in enumerate(re.finditer(regex, resp.content)): if i > 5: break url = match.group(1) print 'url:', url resp = requests.get(url) title = re.search('<h2>(.+)</h2>', resp.content).group(1) print 'title:', title body = resp.content.split('<div id="userbody">', 1)[1] body = body.split('<script type="text/javascript">')[0] body = body.split('<!-- START CLTAGS -->')[0] print 'body:', body print Edit: To clarify, I've used Beautiful Soup and think it's overrated. I thought it was weird and wonky and hard to use in real-world circumstances. What are good Perl or Python starting points for a site scraping library
Using A Gui To Build Packages
C# - Scraping Content From Webpage?
Webscraping with Python

scrape.py scrape.py is a Python module for scraping content from webpages. Using it, you can easily fetch pages, follow links, and submit forms. Cookies, redirections, and SSL are handled automatically. (For SSL, you either need a version of Python with the socket.ssl function, or the curl command-line utility.) scrape.py does not parse the page into a complete parse tree, so it can handle pages with sloppy syntax.
Julian_Todd / Python mechanize cheat sheet
Full API documentation is in the docstrings and the documentation of urllib2. The documentation in these web pages is in need of reorganisation at the moment, after the merge of ClientCookie and ClientForm into mechanize. Tests and examples Examples mechanize — Documentation
Screen Scraping

Khan Academy
Noticeboard for all MSc in Computing Students - DIT