Refine
Year of publication
- 2015 (6)
Document Type
Language
- English (6)
Has Fulltext
- yes (6)
Keywords
- Corpus annotation (6) (remove)
Publicationstate
Reviewstate
- Peer-Review (6)
Publisher
With an increasing amount of text data available it is possible to automatically extract a variety of information about language. One way to obtain knowledge about subtle relations and analogies between words is to observe words which are used in the same context. Recently, Mikolov et al. proposed a method to efficiently compute Euclidean word representations which seem to capture subtle relations and analogies between words in the English language. We demonstrate that this method also captures analogies in the German language. Furthermore, we show that we can transfer information extracted from large non-annotated corpora into small annotated corpora, which are then, in turn, used for training NLP systems.
The IMS Open Corpus Workbench (CWB) software currently uses a simple tabular data model with proven limitations. We outline and justify the need for a new data model to underlie the next major version of CWB. This data model, dubbed Ziggurat, defines a series of types of data layer to represent different structures and relations within an annotated corpus; each such layer may contain variables of different types. Ziggurat will allow us to gradually extend and enhance CWB’s existing CQP-syntax for corpus queries, and also make possible more radical departures relative not only to the current version of CWB but also to other contemporary corpus-analysis software.
The availability of large multi-parallel corpora offers an enormous wealth of material to contrastive corpus linguists, translators and language learners, if we can exploit the data properly. Necessary preparation steps include sentence and word alignment across multiple languages. Additionally, linguistic annotation such as partof- speech tagging, lemmatisation, chunking, and dependency parsing facilitate precise querying of linguistic properties and can be used to extend word alignment to sub-sentential groups. Such highly interconnected data is stored in a relational database to allow for efficient retrieval and linguistic data mining, which may include the statistics-based selection of good example sentences. The varying information needs of contrastive linguists require a flexible linguistic query language for ad hoc searches. Such queries in the format of generalised treebank query languages will be automatically translated into SQL queries.
In this paper, I present the COW14 tool chain, which comprises a web corpus creation tool called texrex, wrappers for existing linguistic annotation tools as well as an online query software called Colibri2. By detailed descriptions of the implementation and systematic evaluations of the performance of the software on different types of systems, I show that the COW14 architecture is capable of handling the creation of corpora of up to at least 100 billion tokens. I also introduce our running demo system which currently serves corpora of up to roughly 20 billion tokens in Dutch, English, French, German, Spanish, and Swedish
Contents:
1. Michal Křen: Recent Developments in the Czech National Corpus, S. 1
2. Dan Tufiş, Verginica Barbu Mititelu, Elena Irimia, Stefan Dumitrescu, Tiberiu Boros, Horia Nicolai Teodorescu: CoRoLa Starts Blooming – An update on the Reference Corpus of Contemporary Romanian Language, S. 5
3. Sebastian Buschjäger, Lukas Pfahler, Katharina Morik: Discovering Subtle Word Relations in Large German Corpora, S. 11
4. Johannes Graën, Simon Clematide: Challenges in the Alignment, Management and Exploitation of Large and Richly Annotated Multi-Parallel Corpora, S. 15
5. Stefan Evert, Andrew Hardie: Ziggurat: A new data model and indexing format for large annotated text corpora, S. 21
6. Roland Schäfer: Processing and querying large web corpora with the COW14 architecture, S. 28
7. Jochen Tiepmar: Release of the MySQL-based implementation of the CTS protocol, S. 35
The Czech National Corpus (CNC) is a longterm project striving for extensive and continuous mapping of the Czech language. This effort results mostly in compilation, maintenance and providing free public access to a range of various corpora with the aim to offer a diverse, representative, and high-quality data for empirical research mainly in linguistics. Since 2012, the CNC is officially recognized as a research infrastructure funded by the Czech Ministry of Education, Youth and Sports which has caused a recent shift towards user service-oriented operation of the project. All project-related resources are now integrated into the CNC research portal at http://www.korpus.cz/. Currently, the CNC has an established and growing user community of more than 4,500 active users in the Czech Republic and abroad who put almost 1,900 queries per day using one of the user interfaces. The paper discusses the main CNC objectives for each particular domain, aiming at an overview of the current situation supplemented by an outline of future plans.