Refine
Year of publication
- 2015 (48) (remove)
Document Type
- Conference Proceeding (17)
- Part of a Book (15)
- Article (9)
- Book (3)
- Working Paper (2)
- Master's Thesis (1)
- Other (1)
Has Fulltext
- yes (48) (remove)
Keywords
- Korpus <Linguistik> (48) (remove)
Publicationstate
- Veröffentlichungsversion (25)
- Zweitveröffentlichung (6)
- Preprint (1)
Reviewstate
Publisher
- Institut für Deutsche Sprache (11)
- Narr (6)
- De Gruyter (2)
- German Society for Computational Linguistics & Language Technology (GSCL) (2)
- Gesellschaft für Sprachtechnologie and Computerlinguistik e.V. (2)
- Lang (2)
- Narr Francke Attempto (2)
- Association for Computational Linguistics (1)
- Association for Computational Linguistics ( ACL ); Curran Associates, Inc. (1)
- Dictionary Society of North America (1)
Ph@ttSessionz and Deutsch heute are two large German speech databases. They were created for different purposes: Ph@ttSessionz to test Internet-based recordings and to adapt speech recognizers to the voices of adolescent speakers, Deutsch heute to document regional variation of German. The databases differ in their recording technique, the selection of recording locations and speakers, elicitation mode, and data processing.
In this paper, we outline how the recordings were performed, how the data was processed and annotated, and how the two databases were imported into a single relational database system. We present acoustical measurements on the digit items of both databases. Our results confirm that the elicitation technique affects the speech produced, that f0 is quite comparable despite different recording procedures, and that large speech technology databases with suitable metadata may well be used for the analysis of regional variation of speech.
In this paper, a method for measuring synchronic corpus (dis-)similarity put forward by Kilgarriff (2001) is adapted and extended to identify trends and correlated changes in diachronic text data, using the Corpus of Historical American English (Davies 2010a) and the Google Ngram Corpora (Michel et al. 2010a). This paper shows that this fully data-driven method, which extracts word types that have undergone the most pronounced change in frequency in a given period of time, is computationally very cheap and that it allows interpretations of diachronic trends that are both intuitively plausible and motivated from the perspective of information theory. Furthermore, it demonstrates that the method is able to identify correlated linguistic changes and diachronic shifts that can be linked to historical events. Finally, it can help to improve diachronic POS tagging and complement existing NLP approaches. This indicates that the approach can facilitate an improved understanding of diachronic processes in language change.
Feedback utterances are among the most frequent in dialogue. Feedback is also a crucial aspect of linguistic theories that take social interaction, involving language, into account. This paper introduces the corpora and datasets of a project scrutinizing this kind of feedback utterances in French. We present the genesis of the corpora (for a total of about 16 hours of transcribed and phone force-aligned speech) involved in the project. We introduce the resulting datasets and discuss how they are being used in on-going work with focus on the form-function relationship of conversational feedback. All the corpora created and the datasets produced in the framework of this project will be made available for research purposes.
This paper discusses computational linguistic methods for the semi-automatic analysis of modality interdependencies (the combination of complex resources such as speaking, writing, and visualizing; MID) in professional crosssituational interaction settings. The overall purpose of the approach is to develop models, methods, and a framework for the description and analysis of MID forms and functions. The paper describes work in progress—the development of an annotation framework that allows annotating different data and file formats at various levels, to relate annotation levels and entries independently of the given file format, and to visualize patterns.
Feedback utterances are among the most frequent in dialogue. Feedback is also a crucial aspect of all linguistic theories that take social interaction involving language into account. However, determining communicative functions is a notoriously difficult task both for human interpreters and systems. It involves an interpretative process that integrates various sources of information. Existing work on communicative function classification comes from either dialogue act tagging where it is generally coarse grained concerning the feed- back phenomena or it is token-based and does not address the variety of forms that feed- back utterances can take. This paper introduces an annotation framework, the dataset and the related annotation campaign (involving 7 raters to annotate nearly 6000 utterances). We present its evaluation not merely in terms of inter-rater agreement but also in terms of usability of the resulting reference dataset both from a linguistic research perspective and from a more applicative viewpoint.
The project Referenzkorpus Altdeutsch (‘Old German Reference Corpus’) aims to es- tablish a deeply-annotated text corpus of all extant Old German texts. As the automated part-of-speech and morphological pre-annotation is amended by hand, a quality control system for the results seems a desirable objective. To this end, standardized inflectional forms, generated using the morphological information, are compared with the attested word forms. Their creation is described by way of example for the Old High German part of the corpus. As is shown, in a few cases, some features of the attested word forms are also required in order to determine as exactly as possible the shape of the inflected lemma form to be created.
Usenet is a large online resource containing user-generated messages (news articles) organised in discussion groups (newsgroups) which deal with a wide variety of different topics. We describe the download, conversion, and annotation of a comprehensive German news corpus for integration in DeReKo, the German Reference Corpus hosted at the Institut für Deutsche Sprache in Mannheim.
The availability of large multi-parallel corpora offers an enormous wealth of material to contrastive corpus linguists, translators and language learners, if we can exploit the data properly. Necessary preparation steps include sentence and word alignment across multiple languages. Additionally, linguistic annotation such as partof- speech tagging, lemmatisation, chunking, and dependency parsing facilitate precise querying of linguistic properties and can be used to extend word alignment to sub-sentential groups. Such highly interconnected data is stored in a relational database to allow for efficient retrieval and linguistic data mining, which may include the statistics-based selection of good example sentences. The varying information needs of contrastive linguists require a flexible linguistic query language for ad hoc searches. Such queries in the format of generalised treebank query languages will be automatically translated into SQL queries.
This paper investigates evidence for linguistic coherence in new urban dialects that evolved in multiethnic and multilingual urban neighbourhoods. We propose a view of coherence as an interpretation of empirical observations rather than something that would be ‘‘out there in the data’’, and argue that this interpretation should be based on evidence of systematic links between linguistic phenomena, as established by patterns of covariation between phenomena that can be shown to be related at linguistic levels. In a case study, we present results from qualitative and quantitative analyses for a set of phenomena that have been described for Kiezdeutsch, a new dialect from multilingual urban Germany. Qualitative analyses point to linguistic relationships between different phenomena and between pragmatic and linguistic levels. Quantitative analyses, based on corpus data from KiDKo (www.kiezdeutschkorpus.de), point to systematic advantages for the Kiezdeutsch data from a multiethnic and multilingual context provided by the main corpus (KiDKo/Mu), compared to complementary corpus data from a mostly monoethnic and monolingual (German) context (KiDKo/Mo). Taken together, this indicates patterns of covariation that support an interpretation of coherence for this new dialect: our findings point to an interconnected linguistic system, rather than to a mere accumulation of individual features. In addition to this internal coherence, the data also points to external coherence: Kiezdeutsch is not disconnected on the outside either, but fully integrated within the general domain of German, an integration that defies a distinction of ‘‘autochthonous’’ and ‘‘allochthonous’’ German, not only at the level of speakers, but also at the level of linguistic systems.