Corpus of Historical American English (COHA)

Shakespeare corpus The source texts came from Online Library of Liberty ( Their original source is the OUP edition of 1916. You get 37 plays, plus all the speeches of all the characters. Ie. you get the whole play Hamlet, plus separately all the speeches of Prince Hamlet, all the speeches of Horatio, etc. There is also a list of the plays and their dates. All the files are saved in 16-bit Unicode. The plays are in the root of 3 folders (comedies, historical, tragedies) as appropriate. Mike Scott mike (at)

Online Etymology Dictionary BASE (British Academic Spoken English) and BASE Plus Collections Overview of BASE The British Academic Spoken English (BASE) project took place at the Universities of Warwick and Reading between 2000–2005, under the directorship of Hilary Nesi (Warwick) , with Paul Thompson (Reading). Natalie Snodgrass and Sarah Creer were employed as research assistants and Tim Kelly was video producer of the project. Lou Burnard (Oxford University) and Adam Kilgarriff (Lexicography MasterClass Ltd) acted as consultants. The BASE Corpus consists of 160 lectures and 40 seminars recorded in a variety of departments (video-recorded at the University of Warwick and audio-recorded at the University of Reading). It contains 1,644,942 tokens in total (lectures and seminars). The corpus has been deposited in the Oxford Text Archive and is catalogued by the Arts and Humanities Data Service. Funding Overview of BASE Plus BASE Plus is a larger collection of British Academic Spoken English data held at the Centre for Applied Linguistics. i. ii. iii. iv. v.

Corpora, Collections, Data Archives 1. British National Corpus (BNC) [100m wds; 1990s British English, spoken & written]: There are many different web sites giving free (but limited) access to the corpus--limited due to copyright: i.e. you cannot expand the concordance context to read more of the surrounding text, & you cannot read the entire source texts (only snippets). BNCweb: User-friendly, free interface (limited features, if no paid licence). JustTheWord: The most accessible site for non-English-speaking background students (& most pedagogically useful) because it straightaway gives you a list of collocations for your search word/phrase, instead of concordances; results are categorized by POS-based patterns & by approximate sense clusters, & graph bars give an indication of how common each combination is. Results are based on a 80K-word subset of the BNC. 2. · Corpus of Contemporary American English (COCA): [450 m wds; 20 m wds of American Eng each year from 1990-2012.] 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

Micase Online Home Page Michigan Corpus of Academic Spoken English Welcome to our NEW interface to the on-line, searchable part of our collection of transcripts of academic speech events recorded at the University of Michigan. There are currently 152 transcripts (totaling 1,848,364 words) available at this site. Browse MICASE Browse the corpus according to specified speaker and speech attributes, returning quick file references. Search MICASE Search the corpus for words or phrases in specified contexts, returning concordance results with references to files, full utterances, and speakers. For additional information, see one of the following: Explanation of Transcription Conventions | Speech Event and Speaker Categories Search Tips | General Information about MICASE We want to hear from you!

Enron Email Dataset This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation. The email dataset was later purchased by Leslie Kaelbling at MIT, and turned out to have a number of integrity problems. I get a number of questions about this corpus each week, which I am unable to answer, mostly because they deal with preparation issues and such that I just don't know about. I am distributing this dataset as a resource for researchers who are interested in improving current email tools, or understanding how email is currently used. March 2, 2004 Version of dataset and the August 21, 2009 Version of dataset are no longer being distributed.

The IViE Corpus The audio files and associated materials available through this site constitute the revised and updated version of the IViE corpus. The original IViE website is still available at The IViE corpus contains recordings of nine urban dialects of English spoken in the British Isles. Recordings of male and female speakers were made in London, Cambridge, Cardiff, Liverpool, Bradford, Leeds, Newcastle, Belfast in Northern Ireland and Dublin in the Republic of Ireland. Three of our speaker groups are from ethnic minorities: we have recorded bilingual Punjabi/English speakers, bilingual Welsh/English speakers and speakers of Carribean descent.

