Blog Post VI: English language corpora

What is a corpus? 

In principle, any collection of more than one text can be called a corpus, (corpus being Latin for “body”, hence a corpus is any body of text). But the term “corpus” when used in the context of modern linguistics tends most frequently to have more specific connotations than this simple definition. The following list describes the four main characteristics of the modern corpus.

  • Sampling and representativeness
  • Finite size
  • Machine-readable form
  • A standard reference

(Tony McEnery & Andrew Wilson)

Features of Corpora

  • Texts prepared for use, usually with search tools
  • Authentic material
  • Speech or writing
  • Annotation:
    • tagging
    • parsing
    • prosody
    • other

 Some English language corpora like the one we could see in class presented by group D: BNC – the British National Corpus. A huge corpus of 100 million words.

http://folk.uio.no/hhasselg/Metode/corpora.htm

 

Blog Post V : Corpus linguistics: TEI

The project of the international consortium for the labelling text TEI (Text Encoding Initiative) is an initiative that has departed from diverse professional associations in the field of humanities. 

TEI’s aim is to foment the use of rigorous and productive etiquettes for any class of text, though its more direct contribution takes place in the field of the texts with cultural and scientific value. These recommendations are to be gather in a compendium known as TEI P3 or TEI Guidelines for Electronic Text Encoding and Interchange.

Some links related to TEI like:

 Multext encompasses a series of projects whose goals are to develop standards and specifications for the encoding and processing of linguistic corpora, and to develop tools, corpora and linguistic resources embodying these standards. Multext is developing tools, corpora, and linguistic resources for a wide variety of languages, including Bambara, Bulgarian, Catalan, Czech, Dutch, English, Estonian, French, German, Hungarian, Italian, Kikongo, Occitan, Romanian, Slovenian, Spanish, Swedish and Swahili. All Multext results are made freely and publicly available for non-commercial, non-military purposes.

TEI’s operability has established environment to four committees that share the responsibility of elaborating the directives. 

*The Committee of Documentation of Texts takes charge defining the etiquettes to identify the texts (origin, location, class, category, etc.).

*The Committee of Textual Representation is busy with describing physics and logically the texts. The logical description includes questions as its structure (chapters, sections, etc.), the typography, the layout, the notes, appendices and diverse references.

*The Committee of Analysis and Interpretation of Texts treats the development of etiquettes that allow the literary description of languages of the text, as well as questions of intertextualidad, indexation.

*The Committee of Questions Metalingüísticas deals with the technical problems of the syntax used in the labelling. 

The questions dissolved by these TEI’s committees give idea of the degree of complexity and precision to which wants to come near. The application of these directives indicates directly to the idea that we were appearing to the beginning, that of a radical transformation in the diffusion and I access to the knowledge.

http://paginaspersonales.deusto.es/abaitua/konzeptu/corpus.htm