background preloader

Google: Le plus grand corpus linguistique de tous les temps

Google: Le plus grand corpus linguistique de tous les temps
Lorsque j'étais étudiant, à la fin des années 70, je n'aurais jamais osé imaginer, même dans mes rêves les plus fous, que la communauté scientifique ait un jour les moyens d'analyser des corpus de textes informatisés de plusieurs de centaines de milliards de mots. A l'époque, j'étais émerveillé par le Brown Corpus, qui comportait la quantité extraordinaire d'un million de mots d'anglais américain, et qui après avoir servi à la compilation de l'American Heritage Dictionary, avait été mis assez largement à disposition des chercheurs. Ce corpus, malgré sa taille, qui apparaît maintenant dérisoire, a permis une quantité impressionnante d'études et a contribué largement à l'essor des technologies du langage... J'ai eu la chance d'avoir pu accéder à l'étude avant publication, et j'ai eu quelque peu le vertige... Et pour le français ? Je ressens aujourd'hui la fascination qu'ont eue sans doute les astronomes qui ont braqué pour la première fois Hubble vers un coin inexploré de l'univers.

Culturomics Culturomics 2.0: Forecasting large-scale human behavior using global news media tone in time and space Global geocoded tone of all Summary of World Broadcasts content January 1979–April 2011 mentioning “Bin Laden” (click to view animation). (Credit: UIC) Computational analysis of large text archives can yield novel insights into the functioning of society, recent literature has suggested, including predicting future economic events, says Kalev Leetaru, Assistant Director for Text and Digital Media Analytics at the Institute for Computing in the Humanities, Arts, and Social Science at the University of Illinois and Center Affiliate of the National Center for Supercomputing Applications. The emerging field of “Culturomics” seeks to explore broad cultural trends through the computerized analysis of vast digital book archives, offering novel insights into the functioning of human society, while books represent the “digested history” of humanity, written with the benefit of hindsight. Global geocoded tone of all New York Times content, 2005 (click on image to see animation).

Culturomics Further reading[edit] References[edit] External links[edit], website by The Cultural Observatory at Harvard directed by Erez Lieberman Aiden and Jean-Baptiste Michel In 500 Billion Words, a New Window on Culture The digital storehouse, which comprises words and short phrases as well as a year-by-year count of how often they appear, represents the first time a data set of this magnitude and searching tools are at the disposal of Ph.D.’s, middle school students and anyone else who likes to spend time in front of a small screen. It consists of the 500 billion words contained in books published between 1500 and 2008 in English, French, Spanish, German, Chinese and Russian. The intended audience is scholarly, but a simple online tool allows anyone with a computer to plug in a string of up to five words and see a graph that charts the phrase’s use over time — a diversion that can quickly become as addictive as the habit-forming game Angry Birds. With a click you can see that “women,” in comparison with “men,” is rarely mentioned until the early 1970s, when feminism gained a foothold. The lines eventually cross paths about 1986. The data set can be downloaded, and users can build their own search tools.

Culturomics research uses quarter-century of media coverage to forecast human behavior "Culturomics" is an emerging field of study into human culture that relies on the collection and analysis of large amounts of data. A previous culturomic research effort used Google's culturomic tool to examine a dataset made up of the text of about 5.2 million books to quantify cultural trends across seven languages and three centuries. Now a new research project has used a supercomputer to examine a dataset made up of a quarter-century of worldwide news coverage to forecast and visualize human behavior. The research used the large shared-memory supercomputer called Nautilus, which is part of the National Institute for Computational Sciences (NICS) network of advanced computing resources at Oak Ridge National Laboratory (ORNL) and boasts 1,024 cores and 4 terabytes of global shared memory. Tone Leetaru says that examining the tone of a news story is one of the most important aspects of his version of culturomics and the most reliable metric for conflict. Location, location, location

Quand Google Books permet de comprendre notre génome culturel Pour une fois, on va dire du bien de Google dans cette lecture de la semaine. A travers un article paru sur le site de Discover Magazine en décembre 2010, sous la plume de Ed Young. Le titre de cet article : “Le génome culturel ; Google Books révèle les traces de la notoriété, de la censure et des changements de la langue”. “De la même manière qu’un fossile nous dit des choses sur l’évolution de la vie sur terre, explique Ed Young, les mots inscrits dans les livres racontent l’histoire de l’humanité. Ils portent une histoire, pas seulement à travers les phrases qu’ils forment, mais aussi par la fréquence de leur occurrence. Heureusement, poursuit Young, c’est exactement ce que fait Google depuis 2004 avec Google Books. 15 millions de livres ont été numérisés aujourd’hui, soit 12 % de l’ensemble des livres qui ont été publiés à ce jour. Maintenant, quelques résultats de ce travail : 1. Image : La croissance de la variété des mots et la difficulté des dictionnaires à en rendre compte. 2.

MemeTracker: tracking news phrases over the web Our adventures in culturomics Peter Aldhous, Jim Giles and MacGregor Campbell, reporters (Image: Michael St. Maur Sheil/Corbis) Here in New Scientist's San Francisco bureau we can't resist an invitation to participate in an entirely new field of research. We soon thought we'd made a real culturomic discovery: nanotechnology has been around since 1899: (Note, to see the clear peaks you need to set the "smoothing" value to zero.) Then we saw the same pattern for searches relating to the internet and cutting-edge biology. Was the world blessed with some spookily prescient authors around the dawn of the twentieth century? But why do glitches cluster around 1899 and 1905? "For your nanotech example, the book has 1905 as publication date," Norvig says. The OCR software, it turns out, has problems with older typefaces and Latin words. So what did we learn, other than that Google's research chief has a sense of humour and that you need to be wary of "dirty" data when embarking on a new avenue of research?

Bluefin Mines Social Media To Improve TV Analytics Googlefight! by Avraham Roos At first sight, googlefight seems like a total waste of time and (because of the fighting) even completely uneducational. When you type in two entries, Googlefight searches the Internet (using Google) for these two words/ phrases and returns a frequency count for each. Why is this useful and how can we use it in class? Secondly it could be used as a spell checker. A third possible use is to give students a lexical set preferably taken out of a text and ask them to guess which is more commonly used. I have been asked so many times: "Teacher, do people REALLY use perfect tenses or is that only something taught in class?" Another possible use could be to give students two lists, one with adjectives and one with nouns. Last but not least, you could use the site just for fun. Enjoy! Avraham Roos Nation, P. & Waring, R. (1997).

Googlewhack A Googlewhack is a type of contest for finding a Google search query consisting of exactly two words without quotation marks, that returns exactly one hit. A Googlewhack must consist of two actual words found in a dictionary. A Googlewhack is considered legitimate if both of the searched-for words appear in the result page. Published googlewhacks are short-lived, since when published to a web site, the new number of hits will become at least two, one to the original hit found, and one to the publishing site.[1] History[edit] The term Googlewhack first appeared on the web at UnBlinking on 8 January 2002;[2] the term was coined by Gary Stock. Participants at discovered the sporadic "cleaner girl" bug in Google's search algorithm where "results 1-1 of thousands" were returned for two relatively common words[3] such as Anxiousness Scheduler[4] or Italianate Tablesides.[5] Googlewhack went offline in November 2009 after Google stopped providing definition links. Score[edit] .

English Letter Frequency Counts: Mayzner Revisited or ETAOIN SRHLDCU Now we show the letter frequencies by position within word. That is, the frequencies for just the first letter in each word, just the second letter, and so on. We also show frequencies for positions relative to the end of the word: "-1" means the last letter, "-2" means the second to last, and so on. We can see that the frequencies vary quite a bit; for example, "e" is uncommon as the first letter (4 times less frequent than elsewhere); similarly "n" is 3 times less common as the first letter than it is overall. The letter "e" makes a comeback as the most common last letter (and also very common at 3rd and 5th letter places). e t a o i n s r h l d c u m f p g w y b v k x j q z 2 z 3 z 4 z 5 z 6 z 7 z -7 z -6 z -5 z -4 z -3 z -2 z -1 z Two-Letter Sequence (Bigram) Counts Now we turn to sequences of letters: consecutive letters anywhere within a word. BI COUNT PERCENT bar graph TH 100.3 B (3.56%) N-Letter Sequences (N-grams) What are the most common n-letter sequences (called "n-grams") for various values of n? Closing Thoughts

Ideas Illustrated » Blog Archive » Visualizing English Word Origins I have been reading a book on the development of the English language recently and I’ve become fascinated with the idea of word etymology — the study of words and their origins. It’s no secret that English is a great borrower of foreign words but I’m not enough of an expert to really understand what that means for my day-to-day use of the language. Simply reading about word history didn’t help me, so I decided that I really needed to see some examples. Using Douglas Harper’s online dictionary of etymology, I paired up words from various passages I found online with entries in the dictionary. For each word, I pulled out the first listed language of origin and then re-constructed the text with some additional HTML infrastructure. The results look like this: The quick brown fox jumps over the lazy dog. This simple sentence is constructed of eight distinct words and one word suffix. A second example shows more variety: What follows are five excerpts taken from a spectrum of written sources.