Ziggurat: A new data model and indexing format for large annotated text corpora

The IMS Open Corpus Workbench (CWB) software currently uses a simple tabular data model with proven limitations. We outline and justify the need for a new data model to underlie the next major version of CWB. This data model, dubbed Ziggurat, defines a series of types of data layer to represent different structures and relations within an annotated corpus; each such layer may contain variables of different types. Ziggurat will allow us to gradually extend and enhance CWB’s existing CQP-syntax for corpus queries, and also make possible more radical departures relative not only to the current version of CWB but also to other contemporary corpus-analysis software.

Metadaten
Author:	Stefan Evert, Andrew Hardie
URN:	urn:nbn:de:bsz:mh39-38335
Parent Title (English):	Proceedings of the 3rd Workshop on Challenges in the Management of Large Corpora (CMLC-3), Lancaster, 20 July 2015
Publisher:	Institut für Deutsche Sprache
Place of publication:	Mannheim
Editor:	Piotr Bański, Hanno Biber, Evelyn Breiteneder, Marc Kupietz, Harald Lüngen, Andreas Witt
Document Type:	Conference Proceeding
Language:	English
Year of first Publication:	2015
Date of Publication (online):	2015/07/02
Publicationstate:	Veröffentlichungsversion
Reviewstate:	Peer-Review
Tag:	Corpus annotation; Corpus linguistics; Corpus query language; Corpus technology; Large corpora
GND Keyword:	Annotation; Datenbanksystem; Korpus <Linguistik>
First Page:	21
Last Page:	27
DDC classes:	400 Sprache / 410 Linguistik
Open Access?:	ja
Linguistics-Classification:	Korpuslinguistik
Conferences, Workshops:	CMLC-3 / 3rd Workshop on Challenges in the Management of Large Corpora
Licence (German):	Creative Commons - Namensnennung-Nicht kommerziell-Keine Bearbeitung 3.0 Deutschland

Open Access