What is a corpus?
In principle, any collection of more than one text can be called a corpus, (corpus being Latin for “body”, hence a corpus is any body of text). But the term “corpus” when used in the context of modern linguistics tends most frequently to have more specific connotations than this simple definition. The following list describes the four main characteristics of the modern corpus.
- Sampling and representativeness
- Finite size
- Machine-readable form
- A standard reference
(Tony McEnery & Andrew Wilson)
Features of Corpora
- Texts prepared for use, usually with search tools
- Authentic material
- Speech or writing
- Annotation:
- tagging
- parsing
- prosody
- other
Some English language corpora like the one we could see in class presented by group D: BNC – the British National Corpus. A huge corpus of 100 million words.
http://folk.uio.no/hhasselg/Metode/corpora.htm