background preloader

Corpora

Facebook Twitter

The IViE Corpus. The audio files and associated materials available through this site constitute the revised and updated version of the IViE corpus. The original IViE website is still available at The IViE corpus contains recordings of nine urban dialects of English spoken in the British Isles. Recordings of male and female speakers were made in London, Cambridge, Cardiff, Liverpool, Bradford, Leeds, Newcastle, Belfast in Northern Ireland and Dublin in the Republic of Ireland.

Three of our speaker groups are from ethnic minorities: we have recorded bilingual Punjabi/English speakers, bilingual Welsh/English speakers and speakers of Carribean descent. International Corpus of English (ICE) Homepage @ ICE-corpora.net. Micase Online Home Page. Michigan Corpus of Academic Spoken English Welcome to our NEW interface to the on-line, searchable part of our collection of transcripts of academic speech events recorded at the University of Michigan.

There are currently 152 transcripts (totaling 1,848,364 words) available at this site. Browse MICASE Browse the corpus according to specified speaker and speech attributes, returning quick file references. Search MICASE Search the corpus for words or phrases in specified contexts, returning concordance results with references to files, full utterances, and speakers. For additional information, see one of the following: Explanation of Transcription Conventions | Speech Event and Speaker Categories Search Tips | General Information about MICASE We want to hear from you! We would like to know who is using this on-line corpus, how you found or heard about this site, and what you think of it. Enron Email Dataset. This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders.

The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation. The email dataset was later purchased by Leslie Kaelbling at MIT, and turned out to have a number of integrity problems. A number of folks at SRI, notably Melinda Gervasio, worked hard to correct these problems, and it is thanks to them (not me) that the dataset is available. The dataset here does not include attachments, and some messages have been deleted "as part of a redaction effort due to requests from affected employees".

I get a number of questions about this corpus each week, which I am unable to answer, mostly because they deal with preparation issues and such that I just don't know about. Corpora, Collections, Data Archives. 1. British National Corpus (BNC) [100m wds; 1990s British English, spoken & written]: There are many different web sites giving free (but limited) access to the corpus--limited due to copyright: i.e. you cannot expand the concordance context to read more of the surrounding text, & you cannot read the entire source texts (only snippets). BNCweb: User-friendly, free interface (limited features, if no paid licence). JustTheWord: The most accessible site for non-English-speaking background students (& most pedagogically useful) because it straightaway gives you a list of collocations for your search word/phrase, instead of concordances; results are categorized by POS-based patterns & by approximate sense clusters, & graph bars give an indication of how common each combination is.

Results are based on a 80K-word subset of the BNC. 2. . · Corpus of Contemporary American English (COCA): [450 m wds; 20 m wds of American Eng each year from 1990-2012.] 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. Collocations.Corpora.Concordancers. Corpora List. Corpus Survey. [For an updated and expanded version of my survey, see Xiao, Richard (2008) "Well-known and influential corpora", in A.

Lüdeling and M. Kyto (eds) Corpus Linguistics: An International Handbook [Volume 1]. Berlin: Mouton de Gruyter. 383-457] 1. Introduction 2. 2.1. 2.2. 2.3. 2.4. 2.5. 2.6. 2.7. 2.8. 2.9. 2.10. 2.11. 2.12. 2.13. 3. 3.1. 3.2. 4. Brown Frown Pre-LOB Kolhapur 5. 5.1. 5.2. 5.3. 5.4. 5.5. 6. 6.1. 6.2. 6.3. 6.4 The Dictionary of Old English Corpus in Electronic Form 6.5 Early English Books Online 6.6 The Corpus of Early English Correspondence 6.7. 6.8. 6.9. 6.10 A Corpus of Late Eighteenth-Century Prose 6.11 A Corpus of Late Modern English Prose 7. 7.1. 7.2. 7.3. 7.4. 7.5. 7.6. 7.7. 7.8. 7.9. 7.10. 7.11. 7.12. 7.13. 7.14. 8. 8.1. 8.2. 8.3. 8.4. 8.5. 8.6. 9. 9.1. 9.2. 9.3. 9.4. 9.5. 9.7. 9.8. 10. 10.1. 10.2. 10.3. 10.4. 10.5. 10.6. 10.7. 10.8. 11. 11.1. 11.2. 11.3. 11.4. 11.5. 11.6. 11.7. 11.8. 11.9. 11.10. 11.11. 11.12. 11.13. 11.14. 11.15. 12. 12.1. 12.2. 12.3. 12.4. 12.5. 12.6. 12.7.

语料库语言学在线. British National Corpus. Shakespeare corpus. The source texts came from Online Library of Liberty ( Their original source is the OUP edition of 1916. You get 37 plays, plus all the speeches of all the characters. Ie. you get the whole play Hamlet, plus separately all the speeches of Prince Hamlet, all the speeches of Horatio, etc. There is also a list of the plays and their dates.

All the files are saved in 16-bit Unicode. All stage directions are in angle brackets. They are further separated by pseudo-XML angle-bracketed tags. The plays are in the root of 3 folders (comedies, historical, tragedies) as appropriate. Mike Scott mike (at) lexically.net. CORPORA: 45-450 million words each. British National Corpus (BYU-BNC) Google Books: American English. Corpus of Contemporary American English (COCA) TIME Magazine Corpus of American English. Corpus of Historical American English (COHA) BASE (British Academic Spoken English) and BASE Plus Collections. Overview of BASE The British Academic Spoken English (BASE) project took place at the Universities of Warwick and Reading between 2000–2005, under the directorship of Hilary Nesi (Warwick) , with Paul Thompson (Reading).

Natalie Snodgrass and Sarah Creer were employed as research assistants and Tim Kelly was video producer of the project. Lou Burnard (Oxford University) and Adam Kilgarriff (Lexicography MasterClass Ltd) acted as consultants. The BASE Corpus consists of 160 lectures and 40 seminars recorded in a variety of departments (video-recorded at the University of Warwick and audio-recorded at the University of Reading). The corpus has been deposited in the Oxford Text Archive and is catalogued by the Arts and Humanities Data Service. Funding The early stages of corpus development were assisted by funding from the Universities of Warwick and Reading , BALEAP, EURALEX, and The British Academy (2000-2001, Grant reference: SG 30284).

Overview of BASE Plus i. Ii. Iii. Iv. V.