background preloader

CommonCrawl

CommonCrawl

http://commoncrawl.org/

Related:  Info searchnetneutrality | access and informationDictionnaires et ressources linguistiques

Free People Search, People Finder and People Locator: Spokeo People Search Spokeo is a people search engine that organizes vast quantities of white-pages listings, social information, and other people-related data from a large variety of public sources. Their mission is to help people find and connect with others, more easily than ever. The public data is presented almost instantly in an integrated, coherent, and easy-to-follow format. We really like this one! We recommend you try out the Spokeo free trail to get a feel for the massive amount of information you can get on people. Nonprofit Common Crawl Offers a Database of the Entire Web, For Free, and Could Open Up Google to New Competition Google famously started out as little more than a more efficient algorithm for ranking Web pages. But the company also built its success on crawling the Web—using software that visits every page in order to build up a vast index of online content. A nonprofit called Common Crawl is now using its own Web crawler and making a giant copy of the Web that it makes accessible to anyone. The organization offers up over five billion Web pages, available for free so that researchers and entrepreneurs can try things otherwise possible only for those with access to resources on the scale of Google’s. “The Web represents, as far as I know, the largest accumulation of knowledge, and there’s so much you can build on top,” says entrepreneur Gilad Elbaz, who founded Common Crawl. “But simply doing the huge amount of work that’s necessary to get at all that information is a large blocker; few organizations … have had the resources to do that.”

Day's blog - Yahoo! Pipes Tutorial - An example using the Fetch Page module to make a web scraper Yahoo! recently released1 a new Fetch Page module which dramatically increases the number of useful things that Pipes can do. With this new "pipe input" module we're no longer restricted to working with well-organised data sets in supported formats such as CSV, RSS, Atom, XML, JSON, iCal or KML. Now we can grab any HTML page we like and use the power of the Regex module to slice and dice the raw text into shape. TRAVIC - Transit Visualization Client This tracker provides movement visualization of transit data published by transit agencies and operators from all over the world. The movements are mostly based on static schedule data. Wherever real-time data is available it is also included in the visualization. TRAVIC is based on a master thesis project by Patrick Brosi. For background information on how TRAVIC is done you may check our blog. All data shown may be subject to terms of others, usually the transport agency or operator publishing the data.

WebSummarizer Web Summarizer is a web-based application specializing in the automatic summarization and visualization of web pages, documents and plain text. WikiSummarizer, a module of WebSummarizer, is a web-based application specializing in the automatic summarization of Wikipedia articles. An integral part of WikiSummarizer is the Wikipedia Knowledge Base. The knowledge base contains summaries of over 3 million Wikipedia articles and provides about 5 million keywords for instant access, discovery, visualization and downloading. 100 Time-Saving Search Engines for Serious Scholars (Revised) Back in 2010, we shared with you 100 awesome search engines and research resources in our post: 100 Time-Saving Search Engines for Serious Scholars. It’s been an incredible resource, but now, it’s time for an update. Some services have moved on, others have been created, and we’ve found some new discoveries, too.

Your company doesn't own your Internet presence PaidContent published an absolutely absurd piece yesterday on the departure of New York Times assistant managing editor Jim Roberts, who happens to be quite popular on Twitter, the real-time microblogging platform. Jim -- a lovely fellow by any measure -- has some 75,000 followers. The premise is that the New York Times or any other employer could very well claim a person's following as a tool of doing business, no different than your corporate laptop or the proprietary documents it holds. Oh, hell no. Allow me to count the ways in which this is ridiculous: Legally, Twitter doesn't own your tweets -- but it does own its user accounts.

Gutenberg Project Note: we also have offline book catalogs to download and use at home. Browse by Author, Title, Language or Recently Posted Our browse pages are ideal to view what's in the collection if you are yet undecided on what you want to read. The recently posted pages list what new books got added or updated most recently. Web mining Web mining - is the application of data mining techniques to discover patterns from the Web. According to analysis targets, web mining can be divided into three different types, which are Web usage mining, Web content mining and Web structure mining. Web usage mining[edit]

10: Journals on LexisNexis AU - Secondary Legal Materials Lesson - Subject Guides at Murdoch University Finding a journal article on LexisNexis AU Step One: Go to the Databases link on the Library home page Step Two: Choose 'L' from the alphabetic index Step Three: Select LexisNexis AU Step Four: Open the link to LexisNexis AU OpenSubtitles A collection of documents from If you use the OpenSubtitle corpus, please, add a link to to your website and to your reports and publications produced with the data! I got the data under this condition! 30 languages, 361 bitextstotal number of files: 20,400total number of tokens: 149.44Mtotal number of sentence fragments: 22.27M Please cite the following article if you use any part of the corpus in your own work: Jörg Tiedemann, 2009, News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. In N. Nicolov and K.

Related: