background preloader

Free or Open source

Facebook Twitter

Remarkable Free Tools for Gathering and Normalizing Messy Data « I recently ran across three remarkable pieces of software: Needle (as in haystack), DEiXTo, and Google Refine. All three products promise to make short work of the labor-intensive job of grooming data aggregated from various unstructured (aka, “messy”) sources, like the world wide web.

I have not used these products yet. Indeed, I may never have an opportunity to fully explore them. So, don’t hold me to anything. Needle Needle is a web-based product that can aggregate data from multiple sources either, online or offline. For example, imagine performing a product search on Amazon, then individually navigating to each result and gathering each product’s details. Now imagine Needle “watching” you perform that manual process once or twice, repeating it ad infinitum, then storing the data gathered into a database. Needle has a limited for-personal-use-only version that is free. DEiXTo DEiXTo is a desktop-based product. Google Refine Google Refine is a desktop-based product. » PHP Screen Scraping Tutorial BRADINO.

Like this blog? Consider exploring one of our sponsored banner ads... UPDATE: New Screen Scraping Post Screen Scraping is a great skill that every PHP developer should have experience with. Basically it involves scraping the source code of a web page, getting it into a string, and then parsing out the parts that you want to use. A simple application of screen scraping could be to build a database of all the NFL teams complete with player details.

What the heck, let’s do it… The first step is to get the page HTML into a PHP variable. This is super easy if the page is publicly accessible via a URL – no login or form post required to access… For more complex scraping you can use cURL to get the html source of the page but the rest of the process would be about the same. The easiest way to do pattern matching I have found is without newlines. So now you have the source code of the page as a string variable, you need to parse out the results. Screen Scraping: How to Screen Scrape a Website with PHP and cURL at DEVTRENCH. Screen scraping has been around on the internet since people could code on it, and there are dozens of resources out there to figure out how to do it (google php screen scrape to see what I mean). I want to touch on some things that I've figured out while scraping some screens.

I assume you have php running, and know your way around Windows. Do it on your local computer. If you are scraping a lot of data you are going to have to do it in an environment that doesn't have script time limits. The server that I use has a max execution time of 30 seconds, which just doesn't work if you are scraping a lot of data off of slow pages. Those are all of my tips. To call curl just write a function like this. Function GetCurlPage ($pageSpec) { return shell_exec("curl $pageSpec"); } This is the code that calls the curl function. Ob_start(); $url = ' $page = GetCurlPage($url); preg_match("~~",$page,$m); print $m[1]; ob_end_flush(); php my_script_name.php > output.txt.

Refine - Google Refine, a power tool for working with messy data (formerly Freebase Gridworks) WebHarvest - web data extraction tool | Free Development software downloads. IRobotSoft -- Visual Web Scraping and Web Automation Tool for FREE. Yahoo! Pipes - Create a web scrapper. Yahoo! Recently released1 a new Fetch Page module which dramatically increases the number of useful things that Pipes can do. With this new "pipe input" module we're no longer restricted to working with well-organised data sets in supported formats such as CSV, RSS, Atom, XML, JSON, iCal or KML. Now we can grab any HTML page we like and use the power of the Regex module to slice and dice the raw text into shape. In a nutshell, the Fetch Page module turns Yahoo! Pipes into a fully fledged web scraping2 IDE! As it happens, I already have a web scraping project which has been broken for some time now.

I don't have the energy to check out the hacky old PHP scrapers and debug the problem. The Task at Hand My web hosting provider (LunarPages3 - affiliate link alert!) So, what will this entail? Looking at the first page5 of the Server Information board, I can get most of the information I need from here. Starting the Pipe It's time to head on over to Yahoo! Fantastic! It looks like Finishing off.