Blog Post VI: English language corpora

What is a corpus? 

In principle, any collection of more than one text can be called a corpus, (corpus being Latin for “body”, hence a corpus is any body of text). But the term “corpus” when used in the context of modern linguistics tends most frequently to have more specific connotations than this simple definition. The following list describes the four main characteristics of the modern corpus.

  • Sampling and representativeness
  • Finite size
  • Machine-readable form
  • A standard reference

(Tony McEnery & Andrew Wilson)

Features of Corpora

  • Texts prepared for use, usually with search tools
  • Authentic material
  • Speech or writing
  • Annotation:
    • tagging
    • parsing
    • prosody
    • other

 Some English language corpora like the one we could see in class presented by group D: BNC – the British National Corpus. A huge corpus of 100 million words.

http://folk.uio.no/hhasselg/Metode/corpora.htm

 

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.