background preloader

API - scraping

Facebook Twitter

Information for Publishers. Control text parsing for your site with HTML To control Instapaper's parser on your own site, you can use the Open Graph protocol.

Information for Publishers

Link Your Sites' Articles to Instapaper Help your readers save your articles for later by linking to your custom Instapaper URL using this format: Each of the url, title, and description values must be URL-encoded, and title and description are optional (but title is recommended). Example link: Save this for later with Instapaper When clicked, readers will be sent through an Instapaper confirmation page and redirected back to your site upon completion. One-click buttons You can create an <IFRAME> button for Instapaper with this format: Like the link option above, each of the url, title, and description values must be URL-encoded, and title and description are optional (but title is recommended).

Example button: If readers are logged into Instapaper, this is a one-click button. Opt out of text parsing for your site. Home · jiminoc/goose Wiki. Our APIs.

Readability

Scraping a page’s content using the node-readability module and Node.js. The following example shows how you can scrape a page’s contents and remove unnecessary markup (similar to by using the Node.js node-readability module.

Scraping a page’s content using the node-readability module and Node.js

First, install the node-readability and sanitizer modules by running the following commands in your Terminal: $ npm install node-readability $ npm install sanitizer Next, create a new JavaScript file, app.js, in the same working directory that you installed the Node modules above and enter the following code: Finally, run the Node.js app by typing $ node . /app.js in your Terminal window. # About — Readability # About the ServiceReadability is a free reading platform that aims to deliver a great reading experience wherever you are, and to provide a system to connect readers to the writers they enjoy.* * * * A Brief HistoryReadability started off as a simple, Javascript-based reading tool that turned any web page into a customizable reading view.

Crawling - The Most Underrated Hack. It’s been a little while since I traded code with anyone.

Crawling - The Most Underrated Hack

But a few weeks ago, one of our entrepreneurs-in-residence, Javier, who joined Redpoint from VMWare, told me about a Ruby gem called Mechanize that makes it really easy to crawl websites, particularly those with username/password logins. In about 30 minutes I had a working LinkedIn crawler built, pulling the names of new followers, new LinkedIn connections and LinkedIn status updates. All of that information is useful for me. But I just can’t seem to pull it from LinkedIn any other way. Crawling is the fastest, easiest and best solution. Over the years, I’ve used or built a number of crawlers: at Google to track competitive market share across ad networks, at Redpoint to build a BD pipleine and to mine social networks, and also mobile app store crawlers.

Crawlers are one of the most powerful tools at the disposal of startups and I think some of the most underrated. (5) What is the best way to extract specific information from HTML using Python. An open source web scraping framework for Python.