Refine
Year of publication
Document Type
- Conference Proceeding (512) (remove)
Is part of the Bibliography
- no (512) (remove)
Keywords
- Korpus <Linguistik> (146)
- Deutsch (117)
- Computerlinguistik (93)
- Annotation (43)
- Automatische Sprachanalyse (38)
- Natürliche Sprache (33)
- Gesprochene Sprache (30)
- Information Extraction (29)
- Englisch (26)
- Metadaten (24)
Publicationstate
- Veröffentlichungsversion (317)
- Zweitveröffentlichung (57)
- Postprint (30)
- Preprint (1)
Reviewstate
- Peer-Review (196)
- (Verlags)-Lektorat (131)
- Review-Status-unbekannt (6)
- Peer-review (3)
- Verlags-Lektorat (1)
Publisher
- Association for Computational Linguistics (35)
- European Language Resources Association (ELRA) (30)
- European Language Resources Association (22)
- Institut für Deutsche Sprache (16)
- International Speech Communication Association (9)
- Springer (8)
- Leibniz-Institut für Deutsche Sprache (7)
- ELRA (6)
- Extreme Markup Languages Conference (6)
- CSLI Publications (5)
Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus
(2021)
Since the introduction of large language models in Natural Language Processing, large raw corpora have played a crucial role in Computational Linguistics. However, most of these large raw corpora are either available only for English or not available to the general public due to copyright issues. Nevertheless, there are some examples of freely available multilingual corpora for training Deep Learning NLP models, such as the OSCAR and Paracrawl corpora. However, they have quality issues, especially for low-resource languages. Moreover, recreating or updating these corpora is very complex. In this work, we try to reproduce and improve the goclassy pipeline used to create the OSCAR corpus. We propose a new pipeline that is faster, modular, parameterizable, and well documented. We use it to create a corpus similar to OSCAR but larger and based on recent data. Also, unlike OSCAR, the metadata information is at the document level. We release our pipeline under an open source license and publish the corpus under a research-only license.
The changes caused by the growing automatisation of processes in the lexicographer´s workstation and in lexicographic work, together with the ensuing needs of lexicographers and their demands for adequately targeted software, have not been discussed sufficiently in meta-lexicographic research. The aim of this paper is therefore to fill this gap, with a focus on academic non-commercial lexicography. After an introduction into the general functionalities of specific dictionary writing software, with the help of a real-life example we will discuss the lexicographic working environment, the new specific demands to lexicographic software as well as different tools. The final aim is to propose some recommendations for how to structure the lexicographic working environment to meet specific project requirements.
This paper outlines the generation process of a specifi computational linguistic representation termed the Multilingual Time Map, conceptually a multi-tape finit state transducer encoding linguistic data at different levels of granularity. The fi st component acquires phonological data from syllable labeled speech data, the second component define feature profiles the third component generates feature hierarchies and augments the acquired data with the define feature profiles and the fourth component displays the Multilingual Time Map as a graph.
Preface
(2019)
Preface
(2020)
This study investigates cross-language differences in pitch range and variation in four languages from two language groups: English and German (Germanic) and Bulgarian and Polish (Slavic). The analysis is based on large multi-speaker corpora (48 speakers for Polish, 60 for each of the other three languages). Linear mixed models were computed that include various distributional measures of pitch level, span and variation, revealing characteristic differences across languages and between language groups. A classification experiment based on the relevant parameter measures (span, kurtosis and skewness values for pitch distributions for each speaker) succeeded in separating the language groups.
This study presents the results of a large-scale comparison of various measures of pitch range and pitch variation in two Slavic (Bulgarian and Polish) and two Germanic (German and British English) languages. The productions of twenty-two speakers per language (eleven male and eleven female) in two different tasks (read passages and number sets) are compared. Significant differences between the language groups are found: German and English speakers use lower pitch maxima, narrower pitch span, and generally less variable pitch than Bulgarian and Polish speakers. These findings support the hypothesis that inguistic communities tend to be characterized by particular pitch profiles.
Based on specific linguistic landmarks in the speech signal, this study investigates pitch level and pitch span differences in English, German, Bulgarian and Polish. The analysis is based on 22 speakers per language (11 males and 11 females). Linear mixed models were computed that include various linguistic measures of pitch level and span, revealing characteristic differences across languages and between language groups. Pitch level appeared to have significantly higher values for the female speakers in the Slavic than the Germanic group. The male speakers showed slightly different results, with only the Polish speakers displaying significantly higher mean values for pitch level than the German males. Overall, the results show that the Slavic speakers tend to have a wider pitch span than the German speakers. But for the linguistic measure, namely for span between the initial peaks and the non-prominent valleys, we only find the difference between Polish and German speakers. We found a flatter intonation contour in German than in Polish, Bulgarian and English male and female speakers and differences in the frequency of the landmarks between languages. Concerning “speaker liveliness” we found that the speakers from the Slavic group are significantly livelier than the speakers from the Germanic group.
Im vorliegenden Beitrag soll gezeigt werden, wie Konnektoren als sprachliche Mittel zur Aktualisierung von zwei Arten konversationeller Aktivitäten eingesetzt werden können, nämlich von intersubjektiven bzw. gesprächsorganisatorischen Verfahren. Auf intersubjektive Verfahren greift ein Sprecher zurück, um in Kooperation mit seinem Gesprächspartner einen gemeinsamen Wissenshintergrund (common ground) zu schaffen. Durch gesprächsorganisatorische Verfahren greift der Sprecher in die gesprächsthematische Struktur des Interaktionsgeschehens ein. In diesem Beitrag wird die Aktualisierung dieser beiden konversationellen Verfahren am Beispiel der kommunikativen Gattung autobiographisches Interview betrachtet. Diese Gattung ist für eine solche Analyse m. E. besonders geeignet, denn sie zeichnet sich durch eine relativ scharfe Trennung der Gesprächsrollen aus, die das Nachvollziehen des Interaktionsgeschehens erleichtert. An einem autobiographischen Interview sind zwei Subjekte beteiligt: der Interviewte, der als Wissensträger gilt, und der Interviewer, der durch seine Rolle als Gesprächsleiter die Wissensvermittlung begünstigen soll. Der Interviewer ist also mit einer zweifachen Aufgabe konfrontiert, denn er muss die anfängliche Wissensasymmetrie ausgleichen und ist zugleich für die Gesprächsorganisation zuständig. Im Folgenden soll am Beispiel des Konjunktors und veranschaulicht werden, wie der Gebrauch von Konnektoren zur Bewältigung dieser beiden kommunikativen Aufgaben beitragen kann.
Prominence has been widely studied on the word level and the syllable level. An extensive study comparing the two approaches is missing in the literature. This study investigates how word and syllable prominence relate to each other in German. We find that perceptual ratings based on the word level are more extreme than those based on the syllable level. The correlations between word prominence and acoustic features are greater than the correlations between syllable prominence and acoustic features.