Korpuslinguistik und interdisziplinäre Perspektiven auf Sprache | Corpus Linguistics and Interdisciplinary Perspectives on Language | CLIP
Refine
Year of publication
Document Type
- Part of a Book (74)
- Book (11)
- Conference Proceeding (1)
Has Fulltext
- yes (86)
Keywords
- Korpus <Linguistik> (76)
- Historische Sprachwissenschaft (44)
- Deutsch (16)
- Grammatik (9)
- Sprachvariante (8)
- Mittelhochdeutsch (6)
- Althochdeutsch (4)
- Annotation (4)
- Gesprochene Sprache (4)
- Datenerhebung (3)
Publicationstate
Reviewstate
- (Verlags)-Lektorat (85)
- Zweitveröffentlichung (3)
Publisher
- Narr (84)
- Narr Francke Attempto (2)
3
Inducing linguistic networks from historical corpora. Towards a new method in historical semantics
(2013)
In this paper, we experiment with exploring linguistic networks as a new method in historical semantics. Our starting point is a long-term historical corpus (i.e., the Patrologia Patina) which we analyse regarding the conceptual stability of a key concept in medieval literature (i.e., virtus). Most analyses in historical semantics explore small data sets by focusing on narrow contexts of lexical usages, but we propose a more comprehensive method based on lexical networks that represent the underlying documents as a whole. We demonstrate both the topological stability of document-based lexical networks and their usefulness in providing empirical evidence in historical semantics.
3
We present LatinlSE, a Latin corpus for the Sketch Engine. LatinlSE consists of Latin works comprising a total of 13 million words, covering the time span from the 2nd Century BC to the 21st century AD. LatinlSE is provided with rich metadata mark-up, including author, title, genre, era, date and century, as well as book, section, paragraph and line of verses. We have automatically annotated LatinlSE with lemma and part-of-speech information, enabling users to search the corpus with a number of criteria, ranging from lemma, part-of speech, context, to subcorpora defined chronologically or by genre. We also illustrate word sketches, one-page summaries of a word’s corpus based collocational behaviour. Our future plan is to produce word sketches for Latin words by adding richer morphological and syntactic annotation to the corpus.
3
The project Referenzkorpus Altdeutsch (‘Old German Reference Corpus’) aims to es- tablish a deeply-annotated text corpus of all extant Old German texts. In order to minimize manual work, an important target was to automate the retrieval of as much data as possible from existing sources. Whilst the texts themselves were already available in a digital form, the annotation data could to a large extent be found within a set of glossaries associated with each text. After digitizing these, the information contained in them could be automatically and semi-automatically linked to the texts. Subsequent manual editing focuses on any remaining gaps and misattributions, re- jecting inapplicable alternatives and adjusting details to the annotation standards of the project. Throughout the process, various problems have been encountered that require special attention to find particular solutions.
3
Historical text presents numerous challenges for Contemporary natural language processing techniques. In particular, the absence of consistent orthographic conventions in historical text presents difficulties for any system requiring reference to a static lexicon indexed by orthographic form. Canonicalization approaches seek to address these issues by assigning an extant equivalent to each word of the input text and deferring application analysis to these canonical forms. Quantitative evaluation of canoni- calization techniques in terms of precision and recall requires reference to a ground-truth corpus in which the canonical form for each corpus token has been manually verified, but such manually annotated corpora are difficult to come by and in general both costly and time-consuming to create. In this paper, we describe a method for bootstrapping a ground-truth canonicalization corpus with minimal manual annotation effort by means of automatic alignment of historical texts with current editions of the same texts, coupled with a two-phase manual review process.
3
The Kassel Corpus of Clause Linking is part of a larger project on the grammar of New High German led by Vilmos Ägel. The project takes as its starting point the assump- tion that equal consideration of both oral and written language is essential in order to understand developments in grammar during the New High German period. The Kassel Corpus of Clause Linking includes four texts from the period 1650-1700 and four texts from 1850-1900, with three texts from each period exemplifying oracy (imme- diacy) and one text literacy (distance). The corpus was annotated for grammatical features which are relevant for clause linking, such as predicates, subjects and connec- tors. A major issue in the process of annotation was to identify correlations between single grammatical features and types of clause linking, such as coordination, subordination and ellipsis. In the paper we will explain the principles which formed the basis for compiling and annotating the corpus, and illustrate how correlations between single grammatical features and types of clause linking may be established. We will also provide an example (non-integrative ellipsis) of how the annotated corpus re- veals differences between immediacy and distance, as well as showing historical developments.
3
Anglo-Latin and Old English. A case for integrated bilingual corpus studies of Anglo-Saxon registers
(2013)
This article describes Anglo-Latin and Old English as two codes correlated in AngloSaxon England with the same cultural elite. Introducing a taxonomy of Anglo-Saxon registers, it claims that Anglo-Latin material can supplement our knowledge of early Old English lexis. A corpus of Medieval Latin from Anglo-Saxon Sourees is advocated as a new electronic resource to facilitate bilingual studies in this field.
3
Although the alternation between was and were has been extensively recorded in modern varieties of British and overseas English, there is comparatively little information about the distribution of was and were in older varieties of speech. This has been largely due both to the scarcity of old regional material, and the consequent lack of dia- chronic dialect corpora. In light of this, this paper looks at some of the Lancashire texts included in the Salamanea Corpus. It examines the evidence provided by literary representations of the dialect with regard to past tense BE forms. Though largely neglected for linguistic investigation, literary samples of Lancashire English may go some way towards casting light on the forms of BE in the county between 1700 and 1900. Our aim is thus twofold: firstly, to contribute to previous research into past tense BE forms in Lancashire by adding historical data that have not been thus far considered, and secondly, to illustrate the linguistic possibilities of the corpus, arguing that it may serve as a complementary missing link to expand the database of English diachronic dialectology.
3
This paper examines genitive Variation in English, using two methodological approaches. In the manual approach, we extract genitive variants from the parsed subcorpora of the text category J (academic writing) in the B-Brown (1931), the Brown (1961) and the Frown (1991/2) corpora. Focussing on the syntactic parameter, we illustrate how the principle of end-weight gains ground from 1930 to 1990. The automatic approach implements the constraints of the manual approach, confirms the findings of the manual approach and is used to scale to British English. Methodologically, we show how to automatically sift out irrelevant corpus examples whose identification would normally need human intervention - in particular, apparent examples of the two main genitive English constructions which are not in genuine alternation.
3
This paper draws attention to the newly available corpus resources for the study of Older Scots, and to the application of the lexical bundles method (Biber et al. 1999) in historical specialized discourse. The discussion concentrates on the method adopted from present-day corpus research, which illuminates historical questions which have so far proved unanswerable, e.g. which multi-word elements in text are stable and repetitive. I applied lexical bundles to legal and administrative texts written in Scots, to observe the degree of formulaicity in early specialized discourse. The results of the study show that the Scottish documents contain highly formulaic long lexical bundles (8-grams and 7-grams) when juxtaposed with other specialized discourse texts, such as the Bible. Similarly, shorter bundles helped to identify the impressive degree of formulaicity in comparison to speechbased legal genres, such as trials and depositions (Culpeper/Kytö 2010).
3
Dictionaries and collections of proverbs, idioms, or phrasemes usually provide synchronic information with only little evidence of actual use. While various extensive dictionaries and collections are available for German, a comprehensive description of structural and semantic ehanges of phrasemes over time is still lacking. Our article highlights some issues and challenges, and presents a semiautomatic corpus-based approach for the diachronic investigation of phraseme development. We argue for a combination of dictionary exploration and corpus-based methods, to provide reliable information about the diachronic development of German phrasemes.