Remarkable Free Tools for Gathering and Normalizing Messy Data « I recently ran across three remarkable pieces of software: Needle (as in haystack), DEiXTo, and Google Refine. All three products promise to make short work of the labor-intensive job of grooming data aggregated from various unstructured (aka, “messy”) sources, like the world wide web. I have not used these products yet. Indeed, I may never have an opportunity to fully explore them. So, don’t hold me to anything. Nevertheless, I don’t want to fail to pass these jewels along before my memory fails me. Needle Needle is a web-based product that can aggregate data from multiple sources either, online or offline. For example, imagine performing a product search on Amazon, then individually navigating to each result and gathering each product’s details. Now imagine Needle “watching” you perform that manual process once or twice, repeating it ad infinitum, then storing the data gathered into a database. Needle has a limited for-personal-use-only version that is free.
DEiXTo Google Refine. Automation - Top screen scraper software. Screen Scraping: How to Screen Scrape a Website with PHP and cURL at DEVTRENCH. Screen scraping has been around on the internet since people could code on it, and there are dozens of resources out there to figure out how to do it (google php screen scrape to see what I mean). I want to touch on some things that I've figured out while scraping some screens. I assume you have php running, and know your way around Windows. Do it on your local computer. If you are scraping a lot of data you are going to have to do it in an environment that doesn't have script time limits. The server that I use has a max execution time of 30 seconds, which just doesn't work if you are scraping a lot of data off of slow pages.
The best thing to do is to run your script from the command line where there is no limit to how long a script can take to execute. Those are all of my tips. To call curl just write a function like this. Function GetCurlPage ($pageSpec) { return shell_exec("curl $pageSpec"); } This is the code that calls the curl function. Php my_script_name.php > output.txt. » PHP Screen Scraping Tutorial BRADINO. Like this blog? Consider exploring one of our sponsored banner ads... UPDATE: New Screen Scraping Post Screen Scraping is a great skill that every PHP developer should have experience with. Basically it involves scraping the source code of a web page, getting it into a string, and then parsing out the parts that you want to use.
What the heck, let’s do it… The first step is to get the page HTML into a PHP variable. The easiest way to do pattern matching I have found is without newlines. So now you have the source code of the page as a string variable, you need to parse out the results. Now we have just the table containing the roster data, and we need to parse out the rows and cells.
So now we have parsed all the data for a given team from the official NFL site. This simple scraping example is just to illustrate the basic concept.