background preloader

Corpus Sites

Facebook Twitter

Lang Learner Corpora

Teaching w Corpora. Spanish. Gendered Language in Teaching Evaluations. WordSmith main page. Windows software for finding word patterns Published by Lexical Analysis Software and Oxford University Press since 1996 Concord ... for finding all instances of a word or phrase.

WordSmith main page

KeyWords ... helps find salient words in a text or set of texts. WordList ... lists the words in your text(s) in alphabetical and frequency order. ALC search. WebCorp: The Web as Corpus. WebCorp Live lets you access the Web as a corpus - a large collection of texts from which examples of real language use can be extracted.

WebCorp: The Web as Corpus

More... Have you tried WebCorp LSE? Our large-scale search engine with more search options, part-of-speech tags and quantitative analyses. More details... Enter the word or phrase you wish to search for in this box. A case insensitive search will match both upper and lower case variants of the search terms. Span will choose the number of words or characters to display as the left and right contexts of the search term. KWiCFinder Web Concordancer & Online Research Tool.

ELAN - The Language Archive. Free CLAWS WWW tagger. Using CLAN. Warning: After installing a new version of CLAN for use with old data, you will need to get a new version of the MOR grammar and run MOR, POST, and CHECK again on your old data to make sure they work with the newer format.

Using CLAN

Alternatively, you may wish continue using old versions of CLAN with old versions of corpora. However, CHILDES data on the web are always updated to run with new versions of CLAN. For Windows: CLANWin is for Windows XP, Vista, 7, 2000. Windows 95, 98, or ME are no longer supported. Also, you will need to install QuickTime and Unicode fonts. Japanese installation instructions are here 日本 For Macintosh: CLAN is for Mac OS X users from 10.6 and up. If you need Unicode fonts, you can get them from here. Unix Installation: For Unix users, we are distributing the source code for CLAN.

Laurence Anthony's AntConc. Older Versions All previous releases of AntConc can be found at the following link.

Laurence Anthony's AntConc

<.exe> files are for Windows. <.zip> files are for Macintosh OS X. <.tar.gz> files are for Linux. All previous releases Development Version Screenshot Viewer. Santa Barbara Corpus of Spoken American English. Parts 1-4 of the Santa Barbara Corpus of Spoken American English (SBCSAE) are now available, for a total of approximately 249,000 words.

Santa Barbara Corpus of Spoken American English

The Santa Barbara Corpus includes transcriptions, audio, and timestamps which correlate transcription and audio at the level of individual intonation units. AccessDescriptionContents and Summaries CitationRecordingsAcknolwedgementsContact Access All transcriptions in the Santa Barbara Corpus parts 1-4 can be dowloaded for free by clicking here. Metadata is available here. To access individual conversations and other discourse segments in the Santa Barbara Corpus, you may select the audio file and transcription you wish to download by consulting the Contents and Summaries. To download the audio files in WAV (recommended) or MP3 format, do the following: Select the transcription you want (e.g. Alternatively, you can do the following: Select a transcription (e.g. Part 1: LDC Catalog No. Open Language Archives Community.

MICUSP Simple interface (BETA) ELISA - English Language Interview Corpus as a Second-Language Learning Application. The ELISA corpus is being developed at the University of Tuebingen (Dept of Applied English Linguistics, AEL) and the University of Surrey (Dept of Languages and Translation Studies, LTS) as a resource for language learning and teaching, and interpreter training.

ELISA - English Language Interview Corpus as a Second-Language Learning Application

It contains interviews with native speakers of English. They talk about their professional career (e.g. in tourism, politics, the media or environmental education). We are very grateful to all speakers for their kind contributions. This demo website contains selected materials from the ELISA corpus. more information, acknowledgements, availability and copyright). ELFA Project – University of Helsinki. On this page you can find: See also: Description of the ELFA corpus project The ELFA corpus was completed in 2008 and its development work is ongoing.

ELFA Project – University of Helsinki

Corpus of Historical American English (COHA) Business Letter Corpus - KWIC Concordancer: Japanese Business ppl w errors. LOCNESS / ICLE - 'non-expert' teen/young adult essays. CHILDES - Child Language Data Exchange System - 'non-expert'. COLT: The Bergen Corpus Of London Teenage Language. VOICE - Project - 'Lingua Franca corpus' In the early 21st century, English in the world finds itself in an “unstable equilibrium”: On the one hand, the majority of the world's English users are not native speakers of the language, but use it as an additional language, as a convenient means for communicative interactions that cannot be conducted in their mother tongues.

VOICE - Project - 'Lingua Franca corpus'

On the other hand, linguistic descriptions have as yet predominantly been focusing on English as it is spoken and written by its native speakers. Open Data for Language Research and Education. TalkBank - Corpus w Audio/Video. Geoffrey Sampson: SUSANNE Scheme - Parsed Corpus. Geoffrey Sampson The Need for Grammatical Taxonomy Since the 1990s, the exciting growth-area in linguistics has been corpus linguistics: studying how English and other languages are used in real life, through analysis of large electronic samples – “corpora” – of spoken or written usage.

Geoffrey Sampson: SUSANNE Scheme - Parsed Corpus

In 2004, together with my colleague Diana McCarthy I edited an anthology of papers illustrating the diverse strengths of modern corpus linguistics. Many findings of corpus linguistics shed new light on the nature of language as a human ability. But corpus analysis is crucial also for enabling computers to process human language. (To get a sense of the massive variety of annotation practices which have emerged from the lack, in the past, of any explicit public taxonomy that researchers could choose to standardize on, see the catalogue compiled by the Linguistic Data Consortium.) Hong Kong Corpus of Spoken English - basic search 'speech corpora' or 'world englishes' ARCHER (The University of Manchester): British & American English 1650-1990.

Compleat Lexical Tutor. International Corpus of English (ICE) Homepage @ ICE-corpora.net. Corpus of Contemporary American English (COCA) [bnc] British National Corpus. Brown Corpus : Nelson Francis and Henry Kucera. Michigan Corpus of Academic Spoken English. Helsinki Corpus (HC): AD850-1710 OLD ENGLISH. The Helsinki Corpus of English Texts is a structured multi-genre diachronic corpus, which includes periodically organized text samples from Old, Middle and Early Modern English.

Helsinki Corpus (HC): AD850-1710 OLD ENGLISH

Each sample is preceded by a list of parameter codes giving information on the text and its author. The Corpus is useful particularly in the study of the change of linguistic features in long diachrony.