background preloader


English Letter Frequency Counts: Mayzner Revisited or ETAOIN SRHLDCU Now we show the letter frequencies by position within word. That is, the frequencies for just the first letter in each word, just the second letter, and so on. We also show frequencies for positions relative to the end of the word: "-1" means the last letter, "-2" means the second to last, and so on. We can see that the frequencies vary quite a bit; for example, "e" is uncommon as the first letter (4 times less frequent than elsewhere); similarly "n" is 3 times less common as the first letter than it is overall. e t a o i n s r h l d c u m f p g w y b v k x j q z 2 z 3 z 4 z 5 z 6 z 7 z -7 z -6 z -5 z -4 z -3 z -2 z -1 z Two-Letter Sequence (Bigram) Counts Now we turn to sequences of letters: consecutive letters anywhere within a word. BI COUNT PERCENT bar graph TH 100.3 B (3.56%) Below is a table of all 26 × 26 = 676 bigrams; in each cell the orange bar is proportional to the frequency, and if you hover you can see the exact counts and percentage. N-Letter Sequences (N-grams) N-gram column notation Closing Thoughts

Ideas Illustrated » Blog Archive » Visualizing English Word Origins I have been reading a book on the development of the English language recently and I’ve become fascinated with the idea of word etymology — the study of words and their origins. It’s no secret that English is a great borrower of foreign words but I’m not enough of an expert to really understand what that means for my day-to-day use of the language. Simply reading about word history didn’t help me, so I decided that I really needed to see some examples. Using Douglas Harper’s online dictionary of etymology, I paired up words from various passages I found online with entries in the dictionary. The results look like this: The quick brown fox jumps over the lazy dog. This simple sentence is constructed of eight distinct words and one word suffix. A second example shows more variety: Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony. What follows are five excerpts taken from a spectrum of written sources. Passage #1: American Literature

Search engine data visualisations | Search insights I’ve decided I need a single place to put all of the search engine data visuals that I’ve been working on. The visuals are made up of thousands of actual queries put into search engines by UK users over the course of a year. This gives us an idea of ‘search demand’ which can/may/should equal actual, offline demand for a topic. Feel free to republish however please link to this blog and also to James Webb who helped to create them. They can be downloaded as PDF’s at the bottom of this page. Click the links below to open the visuals in PDF format for better quality printing / viewing. Overall Gardening Health Science Nature History Questions Like this: Like Loading...

Googlewhack A Googlewhack is a type of contest for finding a Google search query consisting of exactly two words without quotation marks, that returns exactly one hit. A Googlewhack must consist of two actual words found in a dictionary. A Googlewhack is considered legitimate if both of the searched-for words appear in the result page. Published googlewhacks are short-lived, since when published to a web site, the new number of hits will become at least two, one to the original hit found, and one to the publishing site.[1] History[edit] The term Googlewhack first appeared on the web at UnBlinking on 8 January 2002;[2] the term was coined by Gary Stock. Participants at discovered the sporadic "cleaner girl" bug in Google's search algorithm where "results 1-1 of thousands" were returned for two relatively common words[3] such as Anxiousness Scheduler[4] or Italianate Tablesides.[5] Googlewhack went offline in November 2009 after Google stopped providing definition links. Score[edit] .

Googlefight! by Avraham Roos At first sight, googlefight seems like a total waste of time and (because of the fighting) even completely uneducational. But think again. What you are looking at is actually one of the largest free web-based corpora. And it is quite a big corpus if you realise that search engines index about 300 million pages. When you type in two entries, Googlefight searches the Internet (using Google) for these two words/ phrases and returns a frequency count for each. Why is this useful and how can we use it in class? Secondly it could be used as a spell checker. A third possible use is to give students a lexical set preferably taken out of a text and ask them to guess which is more commonly used. I have been asked so many times: "Teacher, do people REALLY use perfect tenses or is that only something taught in class?" Another possible use could be to give students two lists, one with adjectives and one with nouns. Enjoy! Avraham Roos

Bluefin Mines Social Media To Improve TV Analytics Les mots les plus utilisés dans les slogans publicitaires créés en 2012 Bienvenue dans l'Observatoire des slogans publicitaires. Nous vous présentons dans ces pages, les classements tels qu'ils ressortent du recensement quotidien des slogans exploités en France, effectué par Souslelogo pendant l'année écoulée. Nous n'avons retenu que les classements qui présentaient le moins de biais sur un plan statistique afin de conserver aux résultats leur pertinence. Ces données renseignent sur la façon dont les marques se sont exprimées en France à travers leurs slogans (Claims ou signatures de marques). Vous pouvez reproduire ces résultats et les exploitez à votre guise. Nous vous remercions simplement de bien vouloir indiquer la source à chaque reproduction d'un classement ou d'une partie de ceux-ci : © Souslelogo 2014. Si vous souhaitez explorer notre base et effectuer des tris personnalisés, il suffit de nous contacter, nous étudierons avec attention votre demande et vous indiquerons faisabilité, coût et délai. Contact

Our adventures in culturomics Peter Aldhous, Jim Giles and MacGregor Campbell, reporters (Image: Michael St. Maur Sheil/Corbis) Here in New Scientist's San Francisco bureau we can't resist an invitation to participate in an entirely new field of research. We soon thought we'd made a real culturomic discovery: nanotechnology has been around since 1899: (Note, to see the clear peaks you need to set the "smoothing" value to zero.) Then we saw the same pattern for searches relating to the internet and cutting-edge biology. Was the world blessed with some spookily prescient authors around the dawn of the twentieth century? But why do glitches cluster around 1899 and 1905? "For your nanotech example, the book has 1905 as publication date," Norvig says. The OCR software, it turns out, has problems with older typefaces and Latin words. So what did we learn, other than that Google's research chief has a sense of humour and that you need to be wary of "dirty" data when embarking on a new avenue of research?

MemeTracker: tracking news phrases over the web Quand Google Books permet de comprendre notre génome culturel Pour une fois, on va dire du bien de Google dans cette lecture de la semaine. A travers un article paru sur le site de Discover Magazine en décembre 2010, sous la plume de Ed Young. Le titre de cet article : “Le génome culturel ; Google Books révèle les traces de la notoriété, de la censure et des changements de la langue”. “De la même manière qu’un fossile nous dit des choses sur l’évolution de la vie sur terre, explique Ed Young, les mots inscrits dans les livres racontent l’histoire de l’humanité. Heureusement, poursuit Young, c’est exactement ce que fait Google depuis 2004 avec Google Books. 15 millions de livres ont été numérisés aujourd’hui, soit 12 % de l’ensemble des livres qui ont été publiés à ce jour. L’équipe a travaillé sur un tiers du corpus total. 5 millions de livres publiés en Anglais, Français, Espagnol, Allemand, Chinois, Russe et Hébreu, et remontant au 16e siècle. Maintenant, quelques résultats de ce travail : 1. 2. 3. 4. Image : l’évolution de ce que nous mangeons…