background preloader

Search / information retrieval

Facebook Twitter

Folksonomy / social tag

Web scraping. Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.[1] Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler.

It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis. Web scraping a web page involves fetching it and extracting from it.[1][2] Fetching is the downloading of a page (which a browser does when you view the page). Therefore, web crawling is a main component of web scraping, to fetch pages for later processing.

Once fetched, then extraction can take place. Newer forms of web scraping involve listening to data feeds from web servers. History[edit] Techniques[edit] Human copy-and-paste[edit] Faceted search. Flamenco.berkeley.edu/talks/chi_course06.pdf. Www.facetmap.com/pub/strict_faceted_classification.pdf. ROI Of Faceted Navigation? 17 July 2011 Faceted navigation is widespread on the web (a.k.a faceted search and faceted browse). It’s become an expected standard. I’ve written several posts on the subject and also have a popular workshop on faceted navigation. (Next one: 22 Oct 2011 in NYC). I’ve only been able to find a few studies or case studies reporting a measureable ROI of faceted navigation.

One helpful sources is Endeca’s case studies. Kiddicare.com: 100% increase in conversion rates; 100% increase in sales; Additional 100% increase in conversion rates with PowerReviewsAutoScout 24: 5% increase in lead generation to dealers; 70% decrease in no results foundOtto Group: 130% increase in conversion rates; Doubled conversion rates for visitors originating from pay-per-click marketing programs; Search failure rate decreased from over 33% to 0.5% If you have such data or evidence in any form, please let me and others know about by commenting here. Some logical arguments include combinations of the following: How to Make a Faceted Classification and Put It On the Web | Miskatonic University Press. Update February 2011: This has been translated into Dutch: Hoe maak je een facetclassificatie en hoe plaats je haar op het web?

Many thanks to Janette Shew and the Information Architecture Institute's Translations Initiative for doing this. Also, How to Reuse a Faceted Classification and Put It On the Semantic Web, by Bene Rodriguez-Castro, Hugh Glaser and Les Carr, takes my example of dishwashing detergents and extends it into ontologies and RDF. Update February 2007: IA Voice has used this paper as the basis for a series of four podcast episodes! It starts with IA E-Learning: Faceted Classification (1 of 4).

Denton, William. This follows Putting Facets on the Web: An Annotated Bibliography, and is the second paper I wrote for Prof. 0. Faceted classifications are increasingly common on the World Wide Web, especially on commercial web sites (Adkisson 2003). What are facets? Facets and the web go very well together. 1. 1.1 When not to make a faceted classification 2. 2.1. 2.2. Faceted Metadata Search - Search Tools Report. Metadata is information about information: more precisely, it's structured information about resources.

This can be a single set of hierarchical subject labels, such as a Yahoo or Open Directory Project category. More often, the metadata has several facets: attributes in various orthogonal sets of categories. This is often stored in database record fields and tables, especially for product catalogs. Examples of faceted metadata include: Music catalog: songs have attributes such as artist, title, length, genre, date... Recipes: cuisine, main ingredients, cooking style, holiday... Travel site: articles have authors, dates, places, prices... Traditional Approaches to Structured Data Access Parametric Search Traditional field-based or parametric search engines for structured data have used a command line or provided a form to fill out: AU:rosenfeld TI:web PB:oreilly or These require a lot of knowledge on the searcher's side: they have to know the values or choose from a popup menu.

Full-Text Searching and Database Content: SearchTools Report.