background preloader

The Web Robots Pages

The Web Robots Pages
In a nutshell Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol. It works likes this: a robot wants to vists a Web site URL, say Before it does so, it firsts checks for and finds: User-agent: * Disallow: / The "User-agent: *" means this section applies to all robots. There are two important considerations when using /robots.txt: robots can ignore your /robots.txt. So don't try to use /robots.txt to hide information. See also: The details The /robots.txt is a de-facto standard, and is not owned by any standards body. In addition there are external resources: The /robots.txt standard is not actively developed. The rest of this page gives an overview of how to use /robots.txt on your server, with some simple recipes. How to create a /robots.txt file Where to put it The short answer: in the top-level directory of your web server.

html - Can you target <br /> with css jsoup Java HTML Parser, with best of DOM, CSS, and jquery Introduction to Load Balancing Using Node.js - Part 1 by Ross Johnson Introduction At Mazira, a lot of what we develop takes the form of web services. While most of these are only used internally, it is still important that they are high-performance and resilient. These services need to be ready to churn through hundreds of gigabytes of documents at a moment’s notice, say, for example, if we need to reprocess one of our document clusters. Horizontal Scaling When it comes to increasing the performance of websites and web services, there are only a couple of options: increase the efficiency of the code, or scale up the server infrastructure. The Example In order to demonstrate load balancing, we first need a sample application. Listing 1: pi-server.js To run this example, put it into a file pi-server.js, browse to the folder in a command line and then execute: $ node pi-server.js 8000 Now, going to localhost:8000 in a web browser produces the following result: Figure 1: Example Pi-Server Result Load Testing Figure 2: JMeter Test Setup EDIT: Part 2

Robots exclusion standard The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to advising cooperating web crawlers and other web robots about accessing all or part of a website which is otherwise publicly viewable. Robots are often used by search engines to categorize and archive web sites, or by webmasters to proofread source code. The standard is different from, but can be used in conjunction with, Sitemaps, a robot inclusion standard for websites. History[edit] The standard was proposed by Martijn Koster,[1][2] when working for Nexor[3] in February, 1994[4] on the www-talk mailing list, the main communication channel for WWW-related activities at the time. It quickly became a de facto standard that present and future web crawlers were expected to follow; most complied, including those operated by search engines such as WebCrawler, Lycos and AltaVista. About the standard[edit] A robots.txt file covers one origin. Disadvantages[edit] Alternatives[edit]

Boutons graphiques en CSS - CSS Debutant Le bouton en image qui change d'aspect au passage de la souris est très utilisé sur les pages web. Longtemps, la majorité de ces boutons graphiques étaient animés par un javascript ou pire (car plus lourd) par un applet java. Avec les CSS, par effet "rollover", la légèreté et la simplicité est de mise pour créer de beaux boutons. Fonctionne avec : tous les navigateurs graphiques Attributs utilisés : background color display float line-height margin padding text-align ; text-decoration vertical-align width Bouton CSS simple Code (x)html Un bouton étant en général utilisé pour faire un lien vers une autre page, les sélecteurs exploités dans le code CSS seront a et a:hover pour le changement d'aspect au survol du bouton (si changement souhaité bien sûr...). Pour un seul bouton, le code html peut être le suivant : <div class="bouton"><p><a href="#">Bouton</a></p></div> Prenons maintenant deux images dont l'une servira pour le bouton au repos, et l'autre pour le survol. Code CSS Plusieurs boutons CSS

HBase - Installing Apache HBase (TM) on Windows using Cygwin Introduction Apache HBase (TM) is a distributed, column-oriented store, modeled after Google's BigTable. Apache HBase is built on top of Hadoop for its MapReduce and distributed file system implementation. All these projects are open-source and part of the Apache Software Foundation. As being distributed, large scale platforms, the Hadoop and HBase projects mainly focus on *nix environments for production installations. Purpose This document explains the intricacies of running Apache HBase on Windows using Cygwin as an all-in-one single-node installation for testing and development. Installation For running Apache HBase on Windows, 3 technologies are required: Java, Cygwin and SSH. Java HBase depends on the Java Platform, Standard Edition, 6 Release. Cygwin Cygwin is probably the oddest technology in this solution stack. To support installation, the setup.exe utility uses 2 directories on the target system. Make sure you have Administrator privileges on the target system. HBase Configuration

Parallel Processing on the Pi (Bramble) Parallel processing on the Raspberry Pi is possible, thanks to the ultra portable MPICH2 (Message Passing Interface). I was keen to try this out as soon as I managed to get hold of two of these brilliant little computers (yes I'm a lucky boy). Here I'm going to show how I managed to get it all working and will display the results :)(Bramble was a name an ingenious Raspberry Pi forum member made up, not myself!) There are three ways which you can install MPICH2 (in case one doesn't seem to work for you), compiling and installing from source, my .deb package then following the rest of the tutorial, or the Python script file. Installing from source takes a while on the little Pi when not cross compiling. Install - Choose Method 1, 2 or 3 1) Simply download the script;wget and then run it as root with the command;sudo python Then follow the instructions on the screen entering all of the necessary info.

Robots.txt Tutorial How to Create Robots.txt Files Use our Robots.txt generator to create a robots.txt file. Analyze Your Robots.txt File Use our Robots.txt analyzer to analyze your robots.txt file today. Google also offers a similar tool inside of Google Webmaster Central, and shows Google crawling errors for your site. Example Robots.txt Format Allow indexing of everything User-agent: * Disallow: or User-agent: * Allow: / Disallow indexing of everything User-agent: * Disallow: / Disawllow indexing of a specific folder User-agent: * Disallow: /folder/ Disallow Googlebot from indexing of a folder, except for allowing the indexing of one file in that folder User-agent: Googlebot Disallow: /folder1/ Allow: /folder1/myfile.html Background Information on Robots.txt Files Robots.txt files inform search engine spiders how to interact with indexing your content. When you block URLs from being indexed in Google via robots.txt, they may still show those pages as URL only listings in their search results. Crawl Delay

scrollorama Disclaimer: This is an experimental, just-for-fun sort of project and hasn’t been thoroughly tested. Design and build your site, dividing your content into blocks. Embed scrollorama.js after jQuery and initialize the plugin, passing the blocks class selector as a parameter. Target an element and animate its properties. The animation parameters you can use are: Hook into the onBlockChange event. scrollorama.onBlockChange(function() { alert('You just scrolled to block#'+scrollorama.blockIndex); }); Note: If you are not using the pinning feature, it is recommended you disable it.

CouchDB Java API - LightCouch Raspberry Pi Weather Station for schools When I first joined the Raspberry Pi Foundation, over a year ago now, one of my first assignments was to build a weather station around the Raspberry Pi. Thanks to our friends at Oracle (the large US database company), the Foundation received a grant not only to design and build a Raspberry Pi weather station for schools, but also to put together a whole education programme to go with it. Oracle were keen to support a programme where kids get the opportunity to partake in cross-curricular computing and science projects that cover everything from embedded IoT, through networking protocols and databases, to big data. The goals of the project was ambitious. Between us we wanted to create a weather experiment where schools could gather and access weather data from over 1000 weather stations from around the globe. If you’ve been on Twitter a lot you’ll have noticed me teasing this since about March last year. This seemed like a good enough spread of data. Below is my second attempt.