Korpuslinguistik
Refine
Year of publication
Document Type
- Conference Proceeding (52)
- Article (48)
- Part of a Book (35)
- Book (9)
Is part of the Bibliography
- yes (144) (remove)
Keywords
- Korpus <Linguistik> (125)
- Deutsch (38)
- Forschungsdaten (24)
- Gesprochene Sprache (20)
- Annotation (19)
- Datenmanagement (16)
- Computerlinguistik (12)
- corpus linguistics (12)
- Computerunterstützte Kommunikation (8)
- Kontrastive Linguistik (8)
Publicationstate
- Veröffentlichungsversion (102)
- Zweitveröffentlichung (38)
- Postprint (9)
Reviewstate
- Peer-Review (144) (remove)
Publisher
- European Language Resources Association (19)
- Linköping University Electronic Press (10)
- de Gruyter (9)
- CLARIN (7)
- European Language Resources Association (ELRA) (6)
- Peter Lang (6)
- Editura Academiei Române (5)
- European language resources association (ELRA) (5)
- Universitäts- und Landesbibliothek Darmstadt (5)
- Erich Schmidt (4)
This article reports about the on-going work on a new version of the metadata framework Component Metadata Infrastructure (CMDI), central to the CLARIN infrastructure. Version 1.2 introduces a number of important changes based on the experience gathered in the last five years of intensive use of CMDI by the digital humanities community, addressing problems encountered, but also introducing new functionality. Next to the consolidation of the structure of the model and schema sanity, new means for lifecycle management have been introduced aimed at combatting the observed proliferation of components, new mechanism for use of external vocabularies will contribute to more consistent use of controlled values and cues for tools will allow improved presentation of the metadata records to the human users. The feature set has been frozen and approved, and the infrastructure is now entering a transition phase, in which all the tools and data need to be migrated to the new version.
Using the Google Ngram Corpora for six different languages (including two varieties of English), a large-scale time series analysis is conducted. It is demonstrated that diachronic changes of the parameters of the Zipf–Mandelbrot law (and the parameter of the Zipf law, all estimated by maximum likelihood) can be used to quantify and visualize important aspects of linguistic change (as represented in the Google Ngram Corpora). The analysis also reveals that there are important cross-linguistic differences. It is argued that the Zipf–Mandelbrot parameters can be used as a first indicator of diachronic linguistic change, but more thorough analyses should make use of the full spectrum of different lexical, syntactical and stylometric measures to fully understand the factors that actually drive those changes.
Contents:
1. Michal Křen: Recent Developments in the Czech National Corpus, S. 1
2. Dan Tufiş, Verginica Barbu Mititelu, Elena Irimia, Stefan Dumitrescu, Tiberiu Boros, Horia Nicolai Teodorescu: CoRoLa Starts Blooming – An update on the Reference Corpus of Contemporary Romanian Language, S. 5
3. Sebastian Buschjäger, Lukas Pfahler, Katharina Morik: Discovering Subtle Word Relations in Large German Corpora, S. 11
4. Johannes Graën, Simon Clematide: Challenges in the Alignment, Management and Exploitation of Large and Richly Annotated Multi-Parallel Corpora, S. 15
5. Stefan Evert, Andrew Hardie: Ziggurat: A new data model and indexing format for large annotated text corpora, S. 21
6. Roland Schäfer: Processing and querying large web corpora with the COW14 architecture, S. 28
7. Jochen Tiepmar: Release of the MySQL-based implementation of the CTS protocol, S. 35