Korpuslinguistik und interdisziplinäre Perspektiven auf Sprache | Corpus Linguistics and Interdisciplinary Perspectives on Language | CLIP
Refine
Year of publication
Document Type
- Part of a Book (25)
- Book (11)
Has Fulltext
- yes (36)
Keywords
- Korpus <Linguistik> (28)
- Deutsch (10)
- Grammatik (9)
- Historische Sprachwissenschaft (9)
- Althochdeutsch (3)
- Germanistik (3)
- Sprachvariante (3)
- Adjektiv (2)
- Annotation (2)
- Argumentstruktur (2)
Publicationstate
Reviewstate
- (Verlags)-Lektorat (35)
- Zweitveröffentlichung (3)
Publisher
- Narr (34)
- Narr Francke Attempto (2)
3
This paper examines genitive Variation in English, using two methodological approaches. In the manual approach, we extract genitive variants from the parsed subcorpora of the text category J (academic writing) in the B-Brown (1931), the Brown (1961) and the Frown (1991/2) corpora. Focussing on the syntactic parameter, we illustrate how the principle of end-weight gains ground from 1930 to 1990. The automatic approach implements the constraints of the manual approach, confirms the findings of the manual approach and is used to scale to British English. Methodologically, we show how to automatically sift out irrelevant corpus examples whose identification would normally need human intervention - in particular, apparent examples of the two main genitive English constructions which are not in genuine alternation.
3
The majority of language corpora available to date that cover the Late Modern English period (1700-1900) contain samples of writing by the classically educated layers of society. It is this kind of data that the Standard’ history of the English language has been based on. The labouring poor formed the greater part of the population (60-70%) during the Late Modern English period and, though many of them could not write (compulsory elementary schooling was only introduced in 1870), by about 1800 so many could write something that they formed the majority of those normally called ‘literate’. A unique insight into the language use of the labouring poor has been provided through the laws for poor relief, which gave paupers the opportunity to apply for relief from parish funds during the period 1795-1834. For the last 18 years Tony Fairman has collected poor relief application letters from archives of English County Record Offices. This paper describes the data and the compilation principles of the letter collection, as well as the challenges involved in the conversion of the letter collection into a searchable corpus.
5
Research infrastructures for the Humanities can help to share digital resources and content services. In particular, they can help researchers in the Digital Humanities to save time and efforts when developing software to deal with specific research issues. Web services and web applications can be used to build a research infrastructure for sharing data and algorithms. However, the development of such infrastructures and their key software components is a software engineering task that increasingly also poses interesting and challenging research problems for Computer Science.
5
The Georgian Dialect Corpus (GDC) covers a significant segment of the spoken language of Georgia. It is conceived as a sub-corpus of the Georgian National Corpus1 and is designed for wide interdisciplinary research. Since 2006, the project has been fund- ed by the Shota Rustaveli National Science Foundation.
With its structure, the GDC represents a wide spectrum of regional, temporal and stylistic variations of the Georgian linguistic reality. It contains texts from all Georgian dialects (including the dialects spread in Iran, Turkey, and Azerbaijan); intensive work on a corpus of Laz texts is underway.
Currently, we are working on the elaboration of a morphological annotation concept. In this process, the first step is lemmatization. While automatic lemmatization is an easily solvable and trivial problem in corpora of standard languages with exhaustive morphological descriptions, it is a rather difficult task in a dialect corpus containing a comprehensive collection of texts from up to twenty dialects. Therefore it is under- taken manually in most dialect corpora. In our concept, we effectively apply a lexico- graphical datapool and a standard language parser within a semi-automatic annotation process. The lemmatization process is then based on the standard form, dialect lemmata and standard lemmata being “deemed equal”. The implementation of this presupposes the manual lemmatization of a certain amount of dialect texts.
5
The aim of the present paper is to present and discuss a work in progress that involves:
- the creation of online editions of historical documents of a metalinguistic nature, which function both as publications and corpora, allowing for the comparison of manuscript images with the diplomatic edition and providing tools for analysis;
- the application and development of tools that can easily be manipulated by users and adapted to different kinds of historical texts. The project is still in its first phase, which involves inventorying the metalinguistic texts held by the Evora Public Library (BPE). A survey of the texts of this nature identified in the various catalogues of the library has been carried out. Until now, 43 manuscripts and 200 printed texts with metalinguistic interest, all coming from the reserved catalogues of the BPE, have been identified. In the old reading room cata- logue, further 313 works were also identified, while the modern catalogue is yet to be studied. As soon as the inventory is concluded, this will be followed by the organization and the online publication of a catalogue identifying and describing (biblio- graphical description) the works of a metalinguistic nature held by BPE. The texts’ digital processing shall begin after these previous tasks have been completed.
5
This paper investigates the formal, syntactic and discourse-pragmatic properties of clauses in which the division into a predicational base, or topic of the utterance, and a comment on this topic, fails to apply. Since Kuroda (1972), this property has been claimed to hold for sentences representing the so-called thetic type of judgment, as opposed to categorical sentences which display a bipartite division into an aboutness topic and a comment added to this topic. This property is linked to a number of universal formal properties like the use of semantically empty copula or existential verbs, indefinite subject expressions, and verb-subject order (VS) cross-linguistically. We investigate the properties of thetic judgments in Old Germanic and Old Romance in comparison to each other and investigate the similarities between prototypical thetic sentences involving novel referents and VS-clauses with given postverbal sub- jects. Our claim is that the latter are instances of thetiticity in the broader sence of this notion.
5
The relative order of dative and accusative objects in older German is less free than it is today. The reason for this could be that speakers of the direct predecessor of Old High German organized the referents according to the Thematic Hierarchy. If one applies a Case Hierarchy Nom>Acc>Dat to this, the order Nom - Dat - Acc falls out. It becomes apparent that the status of the Thematic Hierarchy is not a factor governing underlying word order, but a factor inducing scrambling. Arguments from binding theory, whose validity is discussed, indicate that the underlying order is ‘accusative before dative’
5
Multi-faceted alignment. Toward automatic detection of textual similarity in Gospel-derived texts
(2015)
Ancient Germanic Bible-derived texts stand in as test material for producing computational means for automatically determining where textual contamination and linguistic interference have influenced the translation process. This paper reports on the results of research efforts that produced a text corpus; a method for decomposing the texts involved into smaller, more directly comparable thematically-related chunks; a database of relationships between these chunks; and a user-interface allowing for searches based on various referential criteria. Finally, the state of the product at the end of the project is discussed, namely as it was handed over to another researcher who has extended it to automatically find semantic and syntactic similarities within comparable chunks.
5
The project Referenzkorpus Altdeutsch (‘Old German Reference Corpus’) aims to es- tablish a deeply-annotated text corpus of all extant Old German texts. As the automated part-of-speech and morphological pre-annotation is amended by hand, a quality control system for the results seems a desirable objective. To this end, standardized inflectional forms, generated using the morphological information, are compared with the attested word forms. Their creation is described by way of example for the Old High German part of the corpus. As is shown, in a few cases, some features of the attested word forms are also required in order to determine as exactly as possible the shape of the inflected lemma form to be created.
5
The availability of electronic corpora of historical stages of languages has been wel- comed as possibly attenuating the inherent problem of diachronic linguistics, i.e. that we only have access to what has chanced to come down to us - the problem which was memorably named by Labov (1992) as one of “Bad Data”. However, such corpora can only give us access to an increased amount ot historical material and this can essentially still only be a partial and possibly distorted picture of the actual language at a particular period of history. Corpora can be improved by taking a more representative sample of extant texts if these are available (as they are in significant number for periods after the invention of printing). But, as examples from the recently compiled GerManC corpus of seventeenth and eighteenth century German show, the evidence from such corpora can still fail to yield definitive answers to our questions about earlier stages of a language. The data still require expert interpretation, and it is important to be realistic about what can legitimately be expected from an electronic historical corpus.