Korpuslinguistik
Refine
Year of publication
Document Type
- Conference Proceeding (150)
- Part of a Book (59)
- Article (33)
- Book (12)
- Preprint (3)
- Working Paper (3)
- Doctoral Thesis (2)
- Master's Thesis (1)
- Other (1)
- Part of Periodical (1)
Language
- English (265) (remove)
Keywords
- Korpus <Linguistik> (231)
- Deutsch (53)
- Annotation (42)
- Gesprochene Sprache (31)
- Forschungsdaten (27)
- Computerlinguistik (21)
- corpus linguistics (21)
- Datenmanagement (20)
- Computerunterstützte Kommunikation (15)
- Corpus linguistics (15)
Publicationstate
- Veröffentlichungsversion (195)
- Zweitveröffentlichung (40)
- Postprint (13)
Reviewstate
Publisher
- European Language Resources Association (24)
- European Language Resources Association (ELRA) (24)
- Institut für Deutsche Sprache (20)
- Leibniz-Institut für Deutsche Sprache (11)
- Linköping University Electronic Press (10)
- Narr (10)
- CLARIN (8)
- Leibniz-Institut für Deutsche Sprache (IDS) (6)
- de Gruyter (6)
- Association for Computational Linguistics (5)
^This paper describes DeReKo (Deutsches Referenzkorpus), the Archive of General Reference Corpora of Contemporary Written German at the Institut für Deutsche Sprache (IDS) in Mannheim, and the rationale behind its development. We discuss its design, its legal background, how to access it, available metadata, linguistic annotation layers, underlying standards, ongoing developments, and aspects of using the archive for empirical linguistic research. The focus of the paper is on the advantages of DEREKO’s design as a primordial sample from which virtual corpora can be drawn for the specific purposes of individual studies. Both concepts, primordial sample and virtual corpus are explained and illustrated in detail. Furthermore, we describe in more detail how DEREKO deals with the fact that all its texts are subject to third parties’ intellectual property rights, and how it deals with the issue of replicability, which is particularly challenging given DEREKO’s dynamic growth and the possibility to construct from it an open number of virtual corpora.
Language resources are often compiled for the purpose of variational analysis, such as studying differences between genres, registers, and disciplines, regional and diachronic variation, influence of gender, cultural context, etc. Often the sheer number of potentially interesting contrastive pairs can get overwhelming due to the combinatorial explosion of possible combinations. In this paper, we present an approach that combines well understood techniques for visualization heatmaps and word clouds with intuitive paradigms for exploration drill down and side by side comparison to facilitate the analysis of language variation in such highly combinatorial situations. Heatmaps assist in analyzing the overall pattern of variation in a corpus, and word clouds allow for inspecting variation at the level of words.
Data Mining with Shallow vs. Linguistic Features to Study Diversification of Scientific Registers
(2014)
We present a methodology to analyze the linguistic evolution of scientific registers with data mining techniques, comparing the insights gained from shallow vs. linguistic features. The focus is on selected scientific disciplines at the boundaries to computer science (computational linguistics, bioinformatics, digital construction, microelectronics). The data basis is the English Scientific Text Corpus (SCITEX) which covers a time range of roughly thirty years (1970/80s to early 2000s) (Degaetano-Ortlieb et al., 2013; Teich and Fankhauser, 2010). In particular, we investigate the diversification of scientific registers over time. Our theoretical basis is Systemic Functional Linguistics (SFL) and its specific incarnation of register theory (Halliday and Hasan, 1985). In terms of methods, we combine corpus-based methods of feature extraction and data mining techniques.
Newspapers became extremely popular in Germany during the 18th and 19th century, and thus increasingly influential for modern German. However, due to the lack of digitized historical newspaper corpora for German, this influence could not be analyzed systematically. In this paper, we introduce the Mannheim Corpus of Digital Newspapers and Magazines, which in its current release comprises 21 newspapers and magazines from the 18th and 19th century. With over 4.1 Mio tokens in about 650 volumes it currently constitutes the largest historical corpus dedicated to newspapers in German. We briefly discuss the prospect of the corpus for analyzing the evolution of news as a genre in its own right and the influence of contextual parameters such as region and register on the language of news. We then focus on one historically influential aspect of newspapers – their role in disseminating foreign words in German. Our preliminary quantitative results indeed indicate that newspapers use foreign words significantly more frequently than other genres, in particular belles lettres.
The TEI has served for many years as a mature annotation format for corpora of different types, including linguistically annotated data. Although it is based on the consensus of a large community, it does not have the legal status of a standard. During the last decade, efforts have been undertaken to develop definitive de jure standards for linguistic data that not only act as a normative basis for the exchange of language corpora but also address recent advancements in technology, such as web-based standards, and the use of large and multiply annotated corpora.
In this article we will provide an overview of the process of international standardization and discuss some of the international standards currently being developed under the auspices of ISO/TC 37, a technical committee called “Terminology and other Language and Content Resources”. After that the relationship between the TEI Guidelines and these specifications, according to their formal model, notation format, and annotation model, will be discussed. The conclusion of the paper provides recommendations for dealing with language corpora.