background preloader

Crawler

Facebook Twitter

LinkExaminer. If you have your own website, then I'm sure you've run into the problem of broken links - whether they are links off of your site that are no longer there, or something that you moved around internally and forgot to update an old page. Either way updating and keeping tabs on a large site can be a real nightmare - and that's where AnalogX LinkExaminer comes in! AnalogX LinkExaminer at its core is a link checker, it goes through each and every page (assuming you have it set to) and parses the HTML in order to extract the links existing on the page. While it's parsing the page, it can also perform other checks; from simple tasks like extracting the page title, to SEO analysis, to more advanced tasks like identifying pages with high similarity to other pages. Once it's done, you can go over the results in the GUI, then export to a variety of formats including CSV and Google-compatible XML sitemaps.

Welcome to Apache Nutch™ 17 March 2014 - Apache Nutch v1.8 Released The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v1.8, we advise all current users and developers of the 1.X series to upgrade to this release. Alhough this release includes library upgrades to Crawler Commons 0.3 and Apache Tika 1.4, it also provides over 30 bug fixes as well as 18 improvements. Please see the list of changes for a full breakdown, or see the release report. As usual in the 1.X series, this release is made available both as source and binary.

Additionally developers can find Maven artifacts within Maven Central. 02 July 2013 - Apache Nutch v2.2.1 Released The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v2.2.1, we advise all current users and developers of the 2.X series to upgrade to this release ASAP. 24th June 2013 - Apache Nutch v1.7 Released The Apache Nutch PMC are extremely pleased to announce the immediate release of Apache Nutch v1.7.

Training: ScraperWiki - outil basé autour d'une communauté qui partage leurs scrapers, basé sur scrapy. Search Engine Optimization Toolkit. Overview The IIS Search Engine Optimization (SEO) Toolkit helps Web developers, hosting providers, and Web server administrators to improve their Web site’s relevance in search results by recommending how to make the site content more search engine-friendly. The IIS SEO Toolkit includes the Site Analysis module, the Robots Exclusion module, and the Sitemaps and Site Indexes module, which let you perform detailed analysis and offer recommendations and editing tools for managing your Robots and Sitemaps files. Improve the volume and quality of traffic to your Web site from search engines The Site Analysis module allows users to analyze local and external Web sites with the purpose of optimizing the site's content, structure, and URLs for search engine crawlers.

In addition, the Site Analysis module can be used to discover common problems in the site content that negatively affects the site visitor experience. Control how search engines access and display Web content Site Analysis Features. Understanding Site Analysis Reports : IIS Search Engine Optimization Toolkit : Hosting Applications on IIS 7. IIS Site Analysis is a tool within the IIS Search Engine Optimization Toolkit that can be used to analyze Web sites with the purpose of optimizing the site's content, structure, and URLs for search engine crawlers.

In addition, the tool can be used to discover and fix common problems in site content that negatively affect the site user experience. The IIS Site Analysis tool includes a user interface that offers a comprehensive set of pre-built reports for Search Engine Optimization (SEO) and displays content-specific problems found during analysis. The IIS Site Analysis tool also lets you create custom queries on the data that was gathered during the analysis. Getting the Site Analysis Reports To perform an analysis of your Web site, follow these steps: Launch IIS Manager. Select the Server node or a site node in the tree view on the left, and then choose the "Search Engine Optimization" feature. Open the "Site Analysis" feature. Navigating Through the Site Analysis Reports Summary Page. How To Install and Use IIS Search Engine Optimization Toolkit (IIS 7.0) We think the IIS Toolkit is absolutely awesome! This post will provide a step by step guide on how to install the powerful IIS 7.0 toolkit from Microsoft, and show you some of the many cool features which can open up a whole new world for extracting information from a website (we are talking about a Xenu link sleuth beater here!).

Please note: IIS 7.0 is only compatible with Windows Vista or Windows 7 The program is quite simple to install but it certainly isn’t one of the most obvious, and when you need a helpful guide there isn’t much about, so here’s something to help get you started… 1. By downloading and installing the Microsoft Web Platform Installer, the set up process of IIS Toolkit becomes a lot easier, so this is a good starting point. 2. Once the web platform installer has downloaded and installed, navigate over to the Microsoft SEO Toolkit page and click on ‘install using web platform installer’ in the download extension box in the right hand column: 3.

Image credits:Lumaxart. IIS SEO Toolkit Secrets You Might Not Know. If you haven’t installed IIS toolkit yet, then what are you waiting for? Go! Now! Install IIS toolkit! This is truly a genius tool, and the amount of data that can be extracted is immense. There are a few hidden doors to this tool, which provide you with more powerful options to manipulate the site analysis data beyond the standard reports and queries. Unlocking custom queries IIS toolkit has a custom query section which you can use for a deep dive into more specific data. For this blog post I will be running over three different custom queries including identifying: Internal anchor text popularityInternal link count304 not modified Let’s kick off with finding the different anchor texts used internally throughout a website and sorting these to display the most popular first.

Open up the ‘Query’ drop down menu shown in the screen shot above, and you will notice there are 4 different queries, including a ‘New Query’ option allowing you to utilise custom queries. How to create a query. Blogs. In this blog we are going to write an example on how to extend the SEO Toolkit functionality, so for that we are going to pretend our company has a large Web site that includes several images, and now we are interested in making sure all of them comply to a certain standard, lets say all of them should be smaller than 1024x768 pixels and that the quality of the images is no less than 16 bits per pixel.

Additionally we would also like to be able to make custom queries that can later allow us to further analyze the contents of the images and filter based on directories and more. For this we will extend the SEO Toolkit crawling process to perform the additional processing for images, we will be adding the following new capabilities: Capture additional information from the Content. In this case we will capture information about the image, in particular we will extend the report to add a "Image Width", "Image Height" and a "Image Pixel Format". Flag additional violations. Enter CrawlerModule. Screaming Frog SEO Spider Tool & Crawler Software. About The Tool The Screaming Frog SEO Spider is a fast and advanced SEO site audit tool. It can be used to crawl both small and large websites, where manually checking every page would be extremely labour intensive, and where you can easily miss a redirect, missing page title, or duplicate page issue.

You can view, analyse and filter the crawl data as it’s gathered and updated in real-time in the apps UI. The SEO Spider allows you to export key onsite SEO elements (URL, page title, meta description, headings etc) to a spread sheet, so it can easily be used as a base for SEO recommendations. Check our out demo video above. Crawl 500 URLs For Free The ‘lite’ version of the tool is free to download and use. For just £199 per year you can purchase a licence, which removes the 500 URL crawl limit, allows you to save crawls, and opens up the spider’s configuration options and advanced features.

FAQ & User Guide For more guidance and tips on our to use the Screaming Frog SEO crawler – Updates.