BLOG POST IV – Corpus

Atkins, Clear and Ostler (1992: 1) they distinguish four fundamental types what they name generically textual collections (text collections):

* Files (file): a repository of texts in magnetic format in which the texts are not related not coordinated of any form, as for example the Oxford Text Archive.

* Libraries of text in magnetic format (ETL: electronic text library): a collection of texts in magnetic format that they possess a standardized format and follow certain conventions as for the content, but without rigorous limitations of selection.

* Corpus: a section of an ETL, created following a few explicit criteria of selection and with a specific intention, for example, Cobuild’s Corpus or Longman/Lancaster’s Corpus.

* Subcorpus: a portion of a corpus, already it is a static component of a major or more complex corpus, or a selection that is done of dynamic form “on-line”.

In EAGLES (1996b) we find a typology of córpora more specific in that the following types of corpus are distinguished:

* Corpus of reference: created in order that it is a representative sample of the most important varieties of a language, as well as of his structures and general vocabulary, so that it offers as wide as possible information about a language and could use as base in the construction of grammars, dictionaries and works of reference. The British National Corpus, the Bank of English and CREA are examples of córpora of reference.

*Corpus monitor : this one is a new type corpus that one has made possible thanks to the immense advances realized in the last years so much in the capacity of storage of text in magnetic format, since regarding his processing. In the first model of corpus monitor, Clear (1987) proposes the creation of a corpus with a constant size, in which new materials were added constant simultaneously that were eliminating equivalent quantities of ancient material and to offer this way to the linguist the possibility of observing recent changes in the use of the language. Simultaneously that the capacity of the computers was increasing, the idea of flow of traffic was taking form, and at present, it is not considered necessary to put limit to the size of the corpus, providing that itgrows with a constitution that could be considered to be equivalent to that of previous and later stadiums.

http://elies.rediris.es/elies18/232.html

BLOG POST III – CLUVI

The CLUVI (Linguistic Corpus of the University of Vigo) is an open set of parallel textual corpora of specialized registers of contemporary Galician language developed by the SLI (Computational Linguistics Group of the University of Vigo) and publicly available in its website since September 2003. The CLUVI Corpus contains over 22 million words, and its main components are the TECTRA Corpus of English-Galician literary texts, the FEGA Corpus of French-Galician literary texts, the LEGA Corpus of Galician-Spanish legal texts, the UNESCO Corpus of English-Galician-French-Spanish scientific-technical divulgation texts, the LOGALIZA Corpus of English-Galician software localization, and the CONSUMER Corpus of Spanish-Galician-Catalan-Basque consumer information. The public searching and browsing tool designed by the SLI is available at http://sli.uvigo.es/CLUVI/.

This web application permits both simple and very complex searches of isolated words or sequences of words, and shows the multilingual equivalences of the terms in context, as found in real and referenced translations. The terms searched can correspond to either of the languages of the translation, but it is also possible to carry out true multilingual searches, that is, to simultaneously search one term from each of the languages of translation. The number of aligned works and language pairs available in the website increases regularly, since the CLUVI is a academic research project in progress and with great vitality. At the moment, the CLUVI Parallel Corpus webpage permits to search five major corpora -TECTRA, FEGA, LEGA, UNESCO and LOGALIZA-, as well as other minor parallel corpora now in progress. It should be pointed out that the CLUVI interface also permits to browse the TURIGAL Corpus of Portuguese-English tourism texts, and the Legebiduna Corpus of Basque-Spanish administrative texts developed by the DEL group at the U. of Deusto.

CLUVI