Refine
Year of publication
- 2015 (52) (remove)
Document Type
- Conference Proceeding (17)
- Part of a Book (15)
- Article (9)
- Book (4)
- Other (3)
- Working Paper (3)
- Master's Thesis (1)
Keywords
- Korpus <Linguistik> (52) (remove)
Publicationstate
- Veröffentlichungsversion (25)
- Zweitveröffentlichung (6)
- Preprint (1)
Reviewstate
Publisher
- Institut für Deutsche Sprache (13)
- Narr (6)
- De Gruyter (2)
- German Society for Computational Linguistics & Language Technology (GSCL) (2)
- Gesellschaft für Sprachtechnologie and Computerlinguistik e.V. (2)
- Lang (2)
- Narr Francke Attempto (2)
- Association for Computational Linguistics (1)
- Association for Computational Linguistics ( ACL ); Curran Associates, Inc. (1)
- Dictionary Society of North America (1)
The project Referenzkorpus Altdeutsch (‘Old German Reference Corpus’) aims to es- tablish a deeply-annotated text corpus of all extant Old German texts. As the automated part-of-speech and morphological pre-annotation is amended by hand, a quality control system for the results seems a desirable objective. To this end, standardized inflectional forms, generated using the morphological information, are compared with the attested word forms. Their creation is described by way of example for the Old High German part of the corpus. As is shown, in a few cases, some features of the attested word forms are also required in order to determine as exactly as possible the shape of the inflected lemma form to be created.
The availability of electronic corpora of historical stages of languages has been wel- comed as possibly attenuating the inherent problem of diachronic linguistics, i.e. that we only have access to what has chanced to come down to us - the problem which was memorably named by Labov (1992) as one of “Bad Data”. However, such corpora can only give us access to an increased amount ot historical material and this can essentially still only be a partial and possibly distorted picture of the actual language at a particular period of history. Corpora can be improved by taking a more representative sample of extant texts if these are available (as they are in significant number for periods after the invention of printing). But, as examples from the recently compiled GerManC corpus of seventeenth and eighteenth century German show, the evidence from such corpora can still fail to yield definitive answers to our questions about earlier stages of a language. The data still require expert interpretation, and it is important to be realistic about what can legitimately be expected from an electronic historical corpus.
Multi-faceted alignment. Toward automatic detection of textual similarity in Gospel-derived texts
(2015)
Ancient Germanic Bible-derived texts stand in as test material for producing computational means for automatically determining where textual contamination and linguistic interference have influenced the translation process. This paper reports on the results of research efforts that produced a text corpus; a method for decomposing the texts involved into smaller, more directly comparable thematically-related chunks; a database of relationships between these chunks; and a user-interface allowing for searches based on various referential criteria. Finally, the state of the product at the end of the project is discussed, namely as it was handed over to another researcher who has extended it to automatically find semantic and syntactic similarities within comparable chunks.
In this paper we present some preliminary considerations concerning the possibility of automatic parsing an annotated corpus for N-N compounds. This should in prin- ciple be possible at least for relational and stereotype compounds, if the lemmatization of the corpus connects the lemmata with lexical entries as described in Höhle (1982). These lexical entries then supply the necessary information about the argument structure of a relational noun or about the stereotypical purpose associated with the noun’s referent which can be used to establish a relation between the first and the head constituent of the compound.
The relative order of dative and accusative objects in older German is less free than it is today. The reason for this could be that speakers of the direct predecessor of Old High German organized the referents according to the Thematic Hierarchy. If one applies a Case Hierarchy Nom>Acc>Dat to this, the order Nom - Dat - Acc falls out. It becomes apparent that the status of the Thematic Hierarchy is not a factor governing underlying word order, but a factor inducing scrambling. Arguments from binding theory, whose validity is discussed, indicate that the underlying order is ‘accusative before dative’
This paper investigates evidence for linguistic coherence in new urban dialects that evolved in multiethnic and multilingual urban neighbourhoods. We propose a view of coherence as an interpretation of empirical observations rather than something that would be ‘‘out there in the data’’, and argue that this interpretation should be based on evidence of systematic links between linguistic phenomena, as established by patterns of covariation between phenomena that can be shown to be related at linguistic levels. In a case study, we present results from qualitative and quantitative analyses for a set of phenomena that have been described for Kiezdeutsch, a new dialect from multilingual urban Germany. Qualitative analyses point to linguistic relationships between different phenomena and between pragmatic and linguistic levels. Quantitative analyses, based on corpus data from KiDKo (www.kiezdeutschkorpus.de), point to systematic advantages for the Kiezdeutsch data from a multiethnic and multilingual context provided by the main corpus (KiDKo/Mu), compared to complementary corpus data from a mostly monoethnic and monolingual (German) context (KiDKo/Mo). Taken together, this indicates patterns of covariation that support an interpretation of coherence for this new dialect: our findings point to an interconnected linguistic system, rather than to a mere accumulation of individual features. In addition to this internal coherence, the data also points to external coherence: Kiezdeutsch is not disconnected on the outside either, but fully integrated within the general domain of German, an integration that defies a distinction of ‘‘autochthonous’’ and ‘‘allochthonous’’ German, not only at the level of speakers, but also at the level of linguistic systems.
This paper aims at showing how quantitative corpus linguistic analysis can inform qualitative analysis of digital media discourse with respect to the mediality of language in use. Using the example of protest discourse in Twitter, in the field of anti-Islamic ‘Pegida’ demonstrations, a three-step method of collecting, reducing and interpreting salient data is proposed. Each step is aligned with operative medial features of the microblog: hashtags, retweets and @-interactions. The exemplary analysis reveals the importance of discussions of attendance numbers in protest discourse and the asymmetry between administrative (i.e. the police) and non-administrative discourse agents. Furthermore, it exemplifies how frequency analysis and sequence analysis can be combined for research in media linguistics.
Feedback utterances are among the most frequent in dialogue. Feedback is also a crucial aspect of all linguistic theories that take social interaction involving language into account. However, determining communicative functions is a notoriously difficult task both for human interpreters and systems. It involves an interpretative process that integrates various sources of information. Existing work on communicative function classification comes from either dialogue act tagging where it is generally coarse grained concerning the feed- back phenomena or it is token-based and does not address the variety of forms that feed- back utterances can take. This paper introduces an annotation framework, the dataset and the related annotation campaign (involving 7 raters to annotate nearly 6000 utterances). We present its evaluation not merely in terms of inter-rater agreement but also in terms of usability of the resulting reference dataset both from a linguistic research perspective and from a more applicative viewpoint.
Feedback utterances are among the most frequent in dialogue. Feedback is also a crucial aspect of linguistic theories that take social interaction, involving language, into account. This paper introduces the corpora and datasets of a project scrutinizing this kind of feedback utterances in French. We present the genesis of the corpora (for a total of about 16 hours of transcribed and phone force-aligned speech) involved in the project. We introduce the resulting datasets and discuss how they are being used in on-going work with focus on the form-function relationship of conversational feedback. All the corpora created and the datasets produced in the framework of this project will be made available for research purposes.
Ph@ttSessionz and Deutsch heute are two large German speech databases. They were created for different purposes: Ph@ttSessionz to test Internet-based recordings and to adapt speech recognizers to the voices of adolescent speakers, Deutsch heute to document regional variation of German. The databases differ in their recording technique, the selection of recording locations and speakers, elicitation mode, and data processing.
In this paper, we outline how the recordings were performed, how the data was processed and annotated, and how the two databases were imported into a single relational database system. We present acoustical measurements on the digit items of both databases. Our results confirm that the elicitation technique affects the speech produced, that f0 is quite comparable despite different recording procedures, and that large speech technology databases with suitable metadata may well be used for the analysis of regional variation of speech.