background preloader

Data Education

Facebook Twitter

IDE - Overview. NetBeans IDE lets you quickly and easily develop Java desktop, mobile, and web applications, as well as HTML5 applications with HTML, JavaScript, and CSS.

IDE - Overview

The IDE also provides a great set of tools for PHP and C/C++ developers. It is free and open source and has a large community of users and developers around the world. Best Support for Latest Java Technologies NetBeans IDE provides first-class comprehensive support for the newest Java technologies and latest Java specification enhancements before other IDEs. It is the first free IDE providing support for JDK 8, JDK 7, Java EE 7 including its related HTML5 enhancements, and JavaFX 2. With its constantly improving Java Editor, many rich features and an extensive range of tools, templates and samples, NetBeans IDE sets the standard for developing with cutting edge technologies out of the box. Fast & Smart Code Editing. Screen_scraping. Screen_scraping. Pyquery: a jquery-like library for python — pyquery 1.2.4 documentation. Pyquery allows you to make jquery queries on xml documents.

pyquery: a jquery-like library for python — pyquery 1.2.4 documentation

The API is as much as possible the similar to jquery. pyquery uses lxml for fast xml and html manipulation. Ironmacro - GUI Automation for .NET. Pyscraper - simple python based HTTP screen scraper. Xkcd-viewer - A small test project using screen scraping. Juicedpyshell - This Python Firefox Shell Mashup lets you automate Firefox browser using python scripts. The Juiced Python Firefox Shell lets you automate a browser using python scripts.

juicedpyshell - This Python Firefox Shell Mashup lets you automate Firefox browser using python scripts

It requires that the pyxpcomext extension be installed. It is useful for browser automation, including automated testing of web sites. It makes it easy to do screen scraping and html manipulation using Python. This project is a fork of the pyShell project. You can find the original pyShell in the pyxpcomext examples. Examples — webscraping documentation. Simple extraction Except project title from the Google Code page: from webscraping import download, xpathD = download.Download()# download and cache the Google Code webpagehtml = D.get(' use xpath to extract the project titleproject_title = xpath.get(html, '//div[@id="pname"]/a/span') Blog scraper Scrape all articles from a blog import itertoolsimport urlparsefrom webscraping import common, download, xpath DOMAIN = ...writer = common.UnicodeWriter('articles.csv')writer.writerow(['Title', 'Num reads', 'URL'])seen_urls = set() # track which articles URL's already seen, to prevent duplicatesD = download.Download() # iterate each of the categoriesfor category_link in ('/developer/knowledge-base?

Examples — webscraping documentation

Business directory threaded scraper. Scrapemark - Documentation. ScrapeMark is analagous to a regular expression engine.

Scrapemark - Documentation

A ‘pattern’ with special syntax is applied to the HTML being scraped. If the pattern correctly matches, captured values are returned. ScrapeMark’s pattern syntax is simpler than regular expression syntax and is optimized for use with HTML. Also, better utilities are provided for structuring and modifying captured text before being returned. Internally, ScrapeMark compiles a pattern down to a set of regular expressions, making it very fast, faster than any DOM-based approach. Pattern Syntax A pattern contains PLAIN OLD HTML, in addition to some special markup. 12.2 Parsing HTML documents. 12.2 Parsing HTML documents This section only applies to user agents, data mining tools, and conformance checkers.

12.2 Parsing HTML documents

The rules for parsing XML documents into DOM trees are covered by the next section, entitled "The XHTML syntax". User agents must use the parsing rules described in this section to generate the DOM trees from text/html resources. Together, these rules define what is referred to as the HTML parser While the HTML syntax described in this specification bears a close resemblance to SGML and XML, it is a separate language with its own parsing rules. Mechanize. Scrapemark - Easy Python Scraping Library. NOTE: This project is no longer maintained!

Scrapemark - Easy Python Scraping Library

(more info) It utilizes an HTML-like markup language to extract the data you need. You get your results as plain old Python lists and dictionaries. How to get along with an ASP webpage. Fingal County Council of Ireland recently published a number of sets of Open Data, in nice clean CSV, XML and KML formats.

How to get along with an ASP webpage

Unfortunately, the one set of Open Data that was difficult to obtain, was the list of sets of open data. That’s because the list was separated into four separate pages. The important thing to observe is that Next >> link is no ordinary link. You can see something is wrong when you hover your cursor over it. Here’s what it looks like in the HTML source code: <a id="lnkNext" href="javascript:__doPostBack('lnkNext','')">Next >></a> What it does (instead of taking the browser to the next page) is execute the javascript function __doPostBack(). Now, this could take a long time to untangle by stepping through the javascript code to the extent that it would be a hopeless waste of time, but for the fact that this is code generated by Microsoft and there are literally millions of webpages that work in exactly the same way.

<script type="text/javascript"> //<! Mechanize. Screen_scraping. Using A Gui To Build Packages. Not everyone is a command line junkie.

Using A Gui To Build Packages

Some folks actually prefer the comfort of a Windows GUI application for performing tasks such as package creation. The NuGet Package Explorer click-once application makes creating packages very easy. It's also a great way to examine packages and learn how packages are structured. If you’re integrating building packages into a build system, then using NuGet.exe to create and publish packages is a better choice. Installation Installing Package Explorer is easy, click here and you’re done! Package Explorer is a click-once application which means every time you launch it, it will check for updates and allow you to keep the application up to date. Creating a Package To create a package, launch Package Explorer and select File > New menu option (or hit CTRL + N).

Then select the Edit > Edit Package Metadata menu option (or CTRL + K) to edit the package metadata. The metadata editor provides a GUI editor for editing the underlying nuspec file. C# - Scraping Content From Webpage? Sponsored Links: Related Forum Messages For ASP.NET category: Scraping Text Of A Webpage?

C# - Scraping Content From Webpage?

Web scraping with Python. What are good Perl or Python starting points for a site scraping library. If you just want to scrape a handful of sites with consistent formatting, the easiest thing would probably be to use requests combined with regular expressions and python's built-in string processing. import re import requests resp = requests.get(' regex = ('<a href="( '([a-zA-z0-9 ]+)</a>') for i, match in enumerate(re.finditer(regex, resp.content)): if i > 5: break url = match.group(1) print 'url:', url resp = requests.get(url) title = re.search('<h2>(.+)</h2>', resp.content).group(1) print 'title:', title body = resp.content.split('<div id="userbody">', 1)[1] body = body.split('<script type="text/javascript">')[0] body = body.split('<! Using A Gui To Build Packages. C# - Scraping Content From Webpage? Webscraping with Python.

Scrape.py. Scrape.py is a Python module for scraping content from webpages. Using it, you can easily fetch pages, follow links, and submit forms. Cookies, redirections, and SSL are handled automatically. (For SSL, you either need a version of Python with the socket.ssl function, or the curl command-line utility.) scrape.py does not parse the page into a complete parse tree, so it can handle pages with sloppy syntax. You are free to locate content in the page according to nearby text, tags, or even comments. You can download the module or read the documentation page. Julian_Todd / Python mechanize cheat sheet. Mechanize — Documentation. Full API documentation is in the docstrings and the documentation of urllib2.

The documentation in these web pages is in need of reorganisation at the moment, after the merge of ClientCookie and ClientForm into mechanize. Tests and examples Examples The front page has some introductory examples. The examples directory in the source packages contains a couple of silly, but working, scripts to demonstrate basic use of the module. See also the forms examples (these examples use the forms API independently of mechanize.Browser). Tests To run the tests: python test.py. Screen Scraping. Probabilistic Graphical Models. About the Course What are Probabilistic Graphical Models? Uncertainty is unavoidable in real-world applications: we can almost never predict with certainty what will happen in the future, and even in the present and the past, many important aspects of the world are not observed with certainty.

Probability theory gives us the basic foundation to model our beliefs about the different possible states of the world, and to update these beliefs as new evidence is obtained. These beliefs can be combined with individual preferences to help guide our actions, and even in selecting which observations to make. While probability theory has existed since the 17th century, our ability to use it effectively on large problems involving many inter-related variables is fairly recent, and is due largely to the development of a framework known as Probabilistic Graphical Models (PGMs).

Course Syllabus Topics covered include: Cryptography. Game Theory. Machine Learning. Khan Academy. Computer Science 101. Noticeboard for all MSc in Computing Students - DIT.