background preloader

Data Scraping Wikipedia with Google Spreadsheets

Data Scraping Wikipedia with Google Spreadsheets
Prompted in part by a presentation I have to give tomorrow as an OU eLearning community session (I hope some folks turn up – the 90 minute session on Mashing Up the PLE – RSS edition is the only reason I’m going in…), and in part by Scott Leslie’s compelling programme for a similar duration Mashing Up your own PLE session (scene scetting here: Hunting the Wily “PLE”), I started having a tinker with using Google spreadsheets as for data table screenscraping. So here’s a quick summary of (part of) what I found I could do. The Google spreadsheet function =importHTML(“”,”table”,N) will scrape a table from an HTML web page into a Google spreadsheet. The URL of the target web page, and the target table element both need to be in double quotes. The number N identifies the N’th table in the page (counting starts at 0) as the target table for data scraping. Grab the URL, fire up a new Google spreadsheet, and satrt to enter the formula “=importHTML” into one of the cells: Why CSV? Lurvely… :-)

Coding for Journalists 101 : A four-part series | Dan Nguyen pronounced fast is danwin Photo by Nico Cavallotto on Flickr Update, January 2012: Everything…yes, everything, is superseded by my free online book, The Bastards Book of Ruby, which is a much more complete walkthrough of basic programming principles with far more practical and up-to-date examples and projects than what you’ll find here. I’m only keeping this old walkthrough up as a historical reference. I’m sure the code is so ugly that I’m not going to even try re-reading it. So check it out: The Bastards Book of Ruby -Dan Update, Dec. 30, 2010: I published a series of data collection and cleaning guides for ProPublica, to describe what I did for our Dollars for Docs project. So a little while ago, I set out to write some tutorials that would guide the non-coding-but-computer-savvy journalist through enough programming fundamentals so that he/she could write a web scraper to collect data from public websites. DISCLAIMER: The code, data files, and results are meant for reference and example only.

Creating a Scraper for Multiple URLs Using Regular Expressions | OutWit Technologies Blog Important Note: The tutorials you will find on this blog may become outdated with new versions of the program. We have now added a series of built-in tutorials in the application which are accessible from the Help menu.You should run these to discover the Hub. NOTE: This tutorial was created using version 0.8.2. The Scraper Editor interface has changed a long time ago. In this example we’ll redo the scraper from the previous lesson using Regular Expressions. Recap: For complex web pages or specific needs, when the automatic data extraction functions (table, list, guess) don’t provide you with exactly what you are looking for, you can extract data manually by creating your own scraper. First, Launch OutWit Hub then open in the Page view: In the page view you will see a list of leading firms by activity. If you click on the List view you can see a list of all the URLs and their related companies. Now lets see if this works. ShareThis

Ruby Programming Language An Introduction to Compassionate Screen Scraping Screen scraping is the art of programatically extracting data from websites. If you think it's useful: it is. If you think it's difficult: it isn't. And if you think it's easy to really piss off administrators with ill-considered scripts, you're damn right. We're going to be doing this tutorial in Python, and will use the httplib2 and BeautifulSoup libraries to make things as easy as possible. Websites crash. For my blog, the error reports I get are all generated by overzealous webcrawlers from search engines (perhaps the most ubiquitous specie of screenscraper). This brings us to my single rule for socially responsible screen scraping: screen scraper traffic should be indistinguishable from human traffic. Cache feverently. Now, armed with those three guidelines, lets get started screen scraping. Setup Libraries First we need to install the httplib2 and BeautifulSoup libraries. sudo easy_install BeautifulSoup sudo easy_install httplib2 Choosing a Scraping Target Now lets get scraping.

Beautiful Soup: We called him Tortoise because he taught us. [ Download | Documentation | Hall of Fame | For enterprise | Source | Changelog | Discussion group | Zine ] You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects. Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. Valuable data that was once locked up in poorly-designed websites is now within your reach. Interested? Getting and giving support If you have questions, send them to the discussion group. If you use Beautiful Soup as part of your work, please consider a Tidelift subscription. Download Beautiful Soup

Python Programming Language – Official Website Branded journalists battle newsroom regulations With social media a big part of newsroom life, individual journalists often find their personal brands attractive selling points for future employers. But lately many of these same social media superstars are questioning whether newsrooms are truly ready for the branded journalist. In late January, Matthew Keys, Deputy Social Media Editor at Reuters, wrote a blog post in which he criticized his former employer (ABC affiliate KGO-TV in San Francisco) for taking issue with his use of social media. Not long after Keys’ post went live, CNN’s Roland Martin was suspended for comments he tweeted during the Super Bowl. What all of these events suggest is that newsrooms are still coming to terms with how to craft a useful social media policy that meets the needs of the organization, its individual employees, the medium and the audience. NPR’s media correspondent David Folkenflik has a thriving social media presence. “I haven’t felt any pressure to be anything other than myself,” Boyer told me.

OutWit Hub IRobotSoft -- Visual Web Scraping and Web Automation Tool for FREE Automated Form Submissions and Data Scraping - MySQL Hello Everyone! I'm working on a project that should help me to automate some processes that are extremely time dependent, with a mySQL database. I'm presently working with 2 developers on this project on a contract basis to complete the job. I'm finding my developer hesitant to come up with a solution on how to possibly implement what I'm requesting be done. Therefore, I'm wondering if it can be done at all. I use an online web application to host my data. All of the merchant's gift card services I use have a platform where you can check card balances. The above link is a page where you can check card balances. I would like the mySQL database or another application if its more suited to have the following happen: I would like to repeat a similar task on a few other balance checking pages as well. Any ideas on if this is possible?

Development of an automated climatic data scraping, filtering and display system 10.1016/j.compag.2009.12.006 : Computers and Electronics in Agriculture Abstract One of the many challenges facing scientists who conduct simulation and analysis of biological systems is the ability to dynamically access spatially referenced climatic, soil and cropland data. Over the past several years, we have developed an Integrated Agricultural Information and Management System (iAIMS), which consists of foundation class climatic, soil and cropland databases. These databases serve as a foundation to develop applications that address different aspects of cropping systems performance and management. In this paper we present the processes and approaches involved in the development of a climatic data system designed to automatically fetch data from different web sources, consolidate the data into a centralized database, and delivery the data through a web-based interface. Climatic data are usually available via web pages or FTP sites. Keywords Copyright © 2009 Elsevier B.V.

Automated Data extraction/Web scraping Services | Web scraping or data extraction is also referred to as “crawling” or ”web scraping”. Web scraping is the process of pulling information or content from disparate websites and organising this data to your requirements, whether it be in a form that allows it to be displayed on a website or used for offline purposes. ..Automated Data Collection.. Some clients need to collect data on a scheduled basis or on-demand. If you need to monitor a specific website (or multiple websites) data as it changes over time web scraping can is whaat you need. Different industires both large and small have instances where they need to record and track data as it changes. Examples of automated data collection include: Monitor mortgage rates from several lending companies on a daily basisAutomatically collect stock price data on your favorite companiesCapture web pages on a daily basis to watch for changes Examples of web site scraping include: How much does the service costs? How frequently can i get data?

Data Feed Scraping Product Feed Creation, Automated Website Data Extraction and Scraping. Feed Optimise™ specialises in the master product feed creation which then can be used as a data backbone to feed into price comparison engines, affiliate networks, shopping channels and more. We deliver high quality, data rich product feeds extracted from your website's data. We pride ourselves on the quality and ability to deliver comprehensive product data feed creation services. From data scraping to data feed delivery and distribution we are able to tackle a range of diverse problems and are the leading authority on web data extraction and processed automation. Feed Optimise does the work for you. We use proprietary software, highly customisable and developed by us to scrape and crawl our customer websites in order to create rich product data feeds for them. Lack of staff resources to carry out product feed creation. Receive your feeds as fresh as possible. Convert your product prices to any currency on run-time.

I need to automate/scrape data from IE I've got a task that is just screaming for automation. Every week, I have to get a number for each of 36 entities for some metrics I do and that basically consists of counting the 'Y's in a certain column in a table on a company web page. Each entity requires picking a value in a dropdown, refreshing the page, and counting 'Y's. Here's what I do know (I don't know what matters): The site uses kerberos for authenticationThe site uses SSLthe page only works reliably in Internet Explorer I have no previous experience with web automation, so I'm flying fairly blind. #! That gets me a blank IE window and an error message reading "Could not start AutoItX3 Control through OLE" Anyone have any ideas? Thanks, Carlos

Related: