Archivage du Web

What the Web Said Yesterday. Malaysia Airlines Flight 17 took off from Amsterdam at 10:31 A.M.

G.M.T. on July 17, 2014, for a twelve-hour flight to Kuala Lumpur. Not much more than three hours later, the plane, a Boeing 777, crashed in a field outside Donetsk, Ukraine. All two hundred and ninety-eight people on board were killed. The plane’s last radio contact was at 1:20 P.M. G.M.T. The Why and What of Web Archives. List of Web archiving initiatives. Map of Web archiving initiatives worldwide in March, 2012.

This page contains a list of Web archiving initiatives worldwide. Common Crawl. New Crawl Data Available! We are very please to announce that new crawl data is now available!

The data was collected in 2013, contains approximately 2 billion web pages and is 102TB in size (uncompressed). We’ve made some changes to the data formats and the directory structure. Please see the details below and please share your thoughts and questions on the Common Crawl Google Group. We have switched from ARC files to WARC files to better match what the industry has standardized on. WARC files allow us to include HTTP request information in the crawl data, add metadata about requests, and cross-reference the text extracts with the specific response that they were generated from. We have switched the metadata files from JSON to WAT files. Perma.cc. Web Archiving Blog Roundup. The following is a guest post by Abbie Grotke, Library of Congress Web Archiving Team Lead While organizations have been archiving the web since the mid-1990s, it’s only in the last few years that there’s been a surge in web archivists speaking out about issues they encounter, uses of archives, and innovations in tools and technologies.

Readers interested in web archiving may have noticed that my colleagues and I blog about it here whenever we get the chance: information about our Library of Congress Web Archives, collaborations we’re involved in, technical issues related to archiving, and more. Not only are our fellow members of the International Internet Preservation Consortium blogging more these days about their work, but others in the community as well. For those who might be interested in learning more about what other organizations are doing, I thought I’d do a roundup of blogs we’re reading. Internet Archive Blogs: Internet Archive’s blog covers a variety of topics about the IA.

De l’intérêt d’inscrire un nom de domaine au registre du commerce et des sociétés. Le décret n° 2012-928 du 31 juillet 2012, entré en vigueur le 1er septembre 2012 (1), permet désormais à un commerçant personne physique « de déclarer en outre le nom de domaine de son site internet » (2) et aux personnes morales « de déclarer en outre le nom de domaine de son ou de ses sites internet » (3) au registre du commerce et des sociétés (RCS).

Web Archiving Bucket. Archiving and Recovering Database-driven Websites. Designing Preservable Websites, Redux. The following is a guest post by Nicholas Taylor, Information Technology Specialist for the Repository Development Group at the Library of Congress.

As much as we can do to preserve archived websites once we have them, the challenges we encounter are always already determined by how those websites were originally constructed. In the interest of giving us and others the best possible chance of preserving your online content, I wanted to follow on an excellent blog post by Robin Davis (previously) of the Smithsonian Institution Archives on the topic of designing preservable websites. Five Tips for Designing Preservable Websites. A few of the Smithsonian’s many websites.

Five Tips for Designing Preservable Websites

Here at the Smithsonian Institution Archives, we take pride in preserving the Institution’s history, including its sizable web presence. While various offices at the Smithsonian create and back up the contents of their websites, the Archives also crawls each website using Heritrix, an open-source tool created by the Internet Archive, to capture content in an archival format.

Internet a enfin son musée. Accessible gratuitement, 24h sur 24, The Big Internet Museum fonctionne exactement comme un musée traditionnel avec des conservateurs, des collections permanentes, des expositions temporaires, plusieurs départements et plus surprenant encore, une boutique (en construction pour l’instant).

Les archives du musée sont présentées sous forme de timeline de 1969 à nos jours. Information Semantic Integration through Communities of Intelligence onLine. Thoughts and projects about web archiving, data preservation and more… PageFreezer.com Website Archiving & Social Media Archiving. Screenshots et instantanés automatisés de pages web. L'archivage du Web, présentation college de france. LIBER 2012. Dr Ulrich Schäfer, German Research Center for Artificial Intelligence (DFKI) (Germany) The Searchbench - Combining Sentence-Semantic, Full Text and Bibliographic Search in Digital Libraries There is search beyond full-text and metadata search!

We describe a novel approach to searching in digital libraries. It is based on sentence-wise syntactic and semantic analysis of native digital or scanned PDF documents. To showcase the usefulness of our approach, we have applied it to a collection of 22,500 open access research papers in language technology. Because the technology is domain-independent, it could also be applied to newspaper texts, technical documentation or other scientific publications. The Searchbench user interface is a web application running in every modern, JavaScript-enabled web browser. Optionally, negative statements can be excluded from the search results, e.g. for the former example, ‘method does not reduce noise’ would be eliminated. [m]ignify – infinite data sourcing. Mignify ( is the concrete expression and achievement of one of our major dream: to associate the magnitude of the Web to the potential mining of it.

Mignify is an open and neutral platform for infinite data sourcing. It is the place to crawl, extract and compute data at large scale. It was setup after the concrete expression of needs from research centers in Europe and Web Intelligence and Web mining companies, who have given a good reception for this project. The platform, still in its beta version, has been improved by continuous developments and worked on several pilots, which includes a prestigious European search engine and several technology companies.

Follow us on Twitter @mignify. iPRES 2012. You can use your mobile device at iPres2012 to track Schedules, Events and Speakers. From your mobile device browse to or scan the QR Code below; Monday (Day 1) Workshop 1: Spanning the boundaries of digital training and education Workshop 2: Towards practical emulation tools and strategies - state of the research meets real-world requirement. Archive: Die ewige Aufbewahrung des Internets. Wer bewahrt die Schätze des Internets – und wie? Archivare und Bibliotheken suchen nach Strategien, weil Dateiformate und Abspielgeräte immer kürzere Lebenszeiten haben.

IIPC

WARC. Dépôt légal du web. Étude de la BNF sur les archives de l’Internet. Cette étude prospective sur les représentations et les attentes des utilisateurs potentiels a le mérite de présenter le principe des archives, leur fonctionnement et de mettre en avant un certain nombre de questions pertinentes. Le web est pour beaucoup de professionnels (veille, e-reputation, communication, marketing) un terrain de recherche. En partant de ce constat, quel sens cela fait-il d’envisager d’archiver le web ? Cela est-il possible ? Préservation. D-Lib Magazine : numéro spécial sur l’archivage du web et son utilisation par les chercheurs. About the Project: DataUp. In today’s technologically advanced world, the data generated by researchers is increasingly born digital and subject to intensive transformation and analyses before publication.

The various file formats, software, and hardware required to succeed in the modern research landscape can become daunting, especially since education about digital data management has not kept pace with these technological advancements. There is a significant gap between the data management skills needed by modern researchers and their current abilities; the gap is more noticeable given the current increase in funder requirements for data management plans and requests by journal publishers to make supporting data publicly available alongside traditional research articles.

The DataUp project was born out of this need for seamless integration of data management into the researchers’ current workflows. Les trois visages de l'archive numérique. J'étais l'autre jour sur Rennes pour un séminaire d'école doctorale consacré à la question de l'archive, en l'excellent compagnie de Nicolas Thély. <HDR>Je me suis efforcé de démontrer quelle était la place aujourd'hui jouée par les nouvelles dynasties de l'archive, la manière dont elles bâtissaient de nouvelles industries de la mémoire, en s'appuyant principalement sur des technologies du souvenir.

HUMMEL AND KEMP: Today's digital documents are tomorrow's dinosaurs. Shortly after Thanksgiving, President Obama ordered federal agencies to “improve the management of federal records” and embrace a “digital-based records-keeping system.” If adopted, these changes would be the most significant in record keeping since President Truman was in office. While the move from filing cabinets and paper to a form of digital preservation is important and necessary, it is much more complex than it sounds. The initiative, if it is not done right, could unintentionally be a hugely wasteful spending exercise, and vital information, records and assets could be lost forever.

Think about the daunting nature of digitizing the federal universe. Archiveteam. Internet Archaeology. ArchiveReady.com: Website Archivability Testing Tool.