
Open Source Software
Get flash to fully experience Pearltrees
Everyone uses web crawlers—indirectly, at least! Every time you search the Internet using a service such as Alta Vista, Excite, or Lycos, you're making use of an index that's based on the output of a web crawler. Web crawlers—also known as spiders, robots, or wanderers—are software programs that automatically traverse the Web. Search engines use crawlers to find what's on the Web; then they construct an index of the pages that were found. You'd like to build a special-purpose index—for example, one that has some understanding of the content stored in multimedia files on the Web.
Writing a Web Crawler in the Java Programming Language
How to write a multi-threaded webcrawler in Java
So, wie das Internet sich rasend schnell ausgebreitet hat, so schnell mussten auch die Suchmaschinen in der Entwicklung sein. Das ganze Netz wäre nur die Hälfte wert, gäbe es nicht die Suchmaschinen und mit ihnen die Möglichkeit, zu finden, was wo steht. Die wohl bekannteste Suchmaschine ist...
BotSpot 2005 ®: the spot for all bots
Contents WebSPHINX ( Web site- S pecific P rocessors for H TML IN formation e X traction) is a Java class library and interactive development environment for web crawlers. A web crawler (also called a robot or spider) is a program that browses and processes Web pages automatically. The Crawler Workbench is a graphical user interface that lets you configure and control a customizable web crawler. Using the Crawler Workbench, you can:
WebSPHINX: A Personal, Customizable Web Crawler
Java tip: How to get a web page | Nadeau Software
Technologies: Java 5+ The starting point for building a link checker, web spider, or web page analyzer is, of course, to get the web page from the web server. Java's java.net package includes classes to manage URLs and to open web server connections. This tip shows how to use them to a get text, image, audio, or data file from a web server. IntroductionCapturing Screen in Java,Capture Screen Shot,How to Capture Screen Using Java Swing
HTML Parser - HTML Parser
Welcome to the homepage of HTMLParser - a super-fast real-time parser for real-world HTML. What has attracted most developers to HTMLParser has been its simplicity in design, speed and ability to handle streaming real-world html. The two fundamental use-cases that are handled by the parser are extraction and transformation (the syntheses use-case, where HTML pages are created from scratch, is better handled by other tools closer to the source of data). While prior versions concentrated on data extraction from web pages, Version 1.4 of the HTMLParser has substantial improvements in the area of transforming web pages, with simplified tag creation and editing, and verbatim toHtml() method output. In general, to use the HTMLParser you will need to be able to write code in the Java programming language.In this article, I guide you through the steps involved in designing a utility to download a Website. This utility downloads only text and image files, but it can easily be extended to download files of any type. At the end of the article I'll provide tips on how you can extend the utility. An absolute URL -- such as http://java.sun.com/products/jdk1.2 -- has all the components required to identify the resource on the Web. In relative URLs, the protocol and the machine name are inherited from the base URL embedded in the document (base tag) or from the URL used to retrieve the document.
Download a Website for offline browsing - JavaWorld
HTTrack Website Copier - Offline Browser
This is Vivalogo's list of best free, downloadable, open source social networking software / scripts (kinda hard to say all these words :) ). SocialEngine is social networking software powered by PHP and Zend. The script lets you easily create your own social network or online community. Includes customizable groups, photo albums, messaging, member profiles, videos, news feeds, a drag-and-drop CMS, and more. iSocial is a free social networking script platform that allows you to create your own Friendster and Orkut like sites.
Top 40 Free Downloadable Open Source Social Networking Software
Screen Capture Tools: 40+ Free Tools and Techniques
Screen capture , or print screen is perhaps the most efficient way to share whatever appears on your desktop. They help tech users like us to share and communicate better with friends and peers. Major operating systems today comes with basic screen capture and print screen function , but if these can’t fulfill what you need from a screen capture then you are probably looking for a screen capturing tool. Screen capturing tools do what the basic tool don’t. What these tools could do varies, including the ability to include sketches and text , instantly upload image online , audio capturing , dimension-specific capturing and more.Open Source Windows - Downloadpedia
The promise of open source software is best quality, flexibility and reliability. This is the updated list of the best open source software. The only way to have TRUE "Open Source Windows" is to have all equivalent native Windows programs uninstalled and removed. Most Popular Azureus - implements the BitTorrent protocol using Java and comes bundled with many features: Multiple torrent downloads, upload and download speed limiting, both globally and per torrent, advanced seeding rules, adjustable disk cache, only uses one port for all the torrents, UPnP sets the forward on your router and more…Java is a trademark or registered trademark of Sun Microsystems, Inc. in the United States and other countries. This site is independent of Sun Microsystems, Inc.
Open Source Crawlers in Java - Heritrix
WebSPHINX ( Website-Specific Processors for HTML INformation eXtraction) is a Java class library and interactive development environment for Web crawlers that browse and process Web pages automatically. Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions.
Open Source Crawlers in Java
Back again for yet another list of nifty tools written in open source Java. So while using a proxy or a automated crawler for information, you'll need to do some intepretation and cleansing of the incoming data. A little googling reveals a few interesting projects that may help us in that area. TagSoup - a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML.

