Processing and querying large web corpora with the COW14 architecture

In this paper, I present the COW14 tool chain, which comprises a web corpus creation tool called texrex, wrappers for existing linguistic annotation tools as well as an online query software called Colibri2. By detailed descriptions of the implementation and systematic evaluations of the performance of the software on different types of systems, I show that the COW14 architecture is capable of handling the creation of corpora of up to at least 100 billion tokens. I also introduce our running demo system which currently serves corpora of up to roughly 20 billion tokens in Dutch, English, French, German, Spanish, and Swedish

Metadaten
Author:	Roland Schäfer
URN:	urn:nbn:de:bsz:mh39-38367
Parent Title (English):	Proceedings of the 3rd Workshop on Challenges in the Management of Large Corpora (CMLC-3), Lancaster, 20 July 2015
Publisher:	Institut für Deutsche Sprache
Place of publication:	Mannheim
Editor:	Piotr Bański, Hanno Biber, Evelyn Breiteneder, Marc Kupietz, Harald Lüngen, Andreas Witt
Document Type:	Conference Proceeding
Language:	English
Year of first Publication:	2015
Date of Publication (online):	2015/07/02
Publicationstate:	Veröffentlichungsversion
Reviewstate:	Peer-Review
Tag:	Corpus annotation; Corpus query language; Corpus technology; Large corpora; Web corpus
GND Keyword:	Annotation; Datenbanksystem; Korpus <Linguistik>
First Page:	28
Last Page:	34
DDC classes:	400 Sprache / 410 Linguistik
Open Access?:	ja
Linguistics-Classification:	Korpuslinguistik
Conferences, Workshops:	CMLC-3 / 3rd Workshop on Challenges in the Management of Large Corpora
Licence (German):	Creative Commons - Namensnennung-Nicht kommerziell-Keine Bearbeitung 3.0 Deutschland

Open Access