Refine
Year of publication
Document Type
- Conference Proceeding (317) (remove)
Has Fulltext
- yes (317)
Is part of the Bibliography
- no (317) (remove)
Keywords
- Korpus <Linguistik> (112)
- Deutsch (66)
- Computerlinguistik (57)
- Annotation (35)
- Automatische Sprachanalyse (31)
- Metadaten (21)
- Natürliche Sprache (19)
- Datenmanagement (18)
- Information Extraction (18)
- Fremdsprachenlernen (17)
Publicationstate
- Veröffentlichungsversion (317) (remove)
Reviewstate
- Peer-Review (152)
- (Verlags)-Lektorat (110)
- Review-Status-unbekannt (5)
- Peer-review (2)
- Verlags-Lektorat (1)
Publisher
- Association for Computational Linguistics (33)
- European Language Resources Association (ELRA) (26)
- European Language Resources Association (18)
- Institut für Deutsche Sprache (16)
- International Speech Communication Association (9)
- Leibniz-Institut für Deutsche Sprache (7)
- Extreme Markup Languages Conference (6)
- CSLI Publications (5)
- LiU Electronic Press (5)
- Nisaba (5)
Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus
(2021)
Since the introduction of large language models in Natural Language Processing, large raw corpora have played a crucial role in Computational Linguistics. However, most of these large raw corpora are either available only for English or not available to the general public due to copyright issues. Nevertheless, there are some examples of freely available multilingual corpora for training Deep Learning NLP models, such as the OSCAR and Paracrawl corpora. However, they have quality issues, especially for low-resource languages. Moreover, recreating or updating these corpora is very complex. In this work, we try to reproduce and improve the goclassy pipeline used to create the OSCAR corpus. We propose a new pipeline that is faster, modular, parameterizable, and well documented. We use it to create a corpus similar to OSCAR but larger and based on recent data. Also, unlike OSCAR, the metadata information is at the document level. We release our pipeline under an open source license and publish the corpus under a research-only license.
Preface
(2019)
Preface
(2020)
This study investigates cross-language differences in pitch range and variation in four languages from two language groups: English and German (Germanic) and Bulgarian and Polish (Slavic). The analysis is based on large multi-speaker corpora (48 speakers for Polish, 60 for each of the other three languages). Linear mixed models were computed that include various distributional measures of pitch level, span and variation, revealing characteristic differences across languages and between language groups. A classification experiment based on the relevant parameter measures (span, kurtosis and skewness values for pitch distributions for each speaker) succeeded in separating the language groups.
This study presents the results of a large-scale comparison of various measures of pitch range and pitch variation in two Slavic (Bulgarian and Polish) and two Germanic (German and British English) languages. The productions of twenty-two speakers per language (eleven male and eleven female) in two different tasks (read passages and number sets) are compared. Significant differences between the language groups are found: German and English speakers use lower pitch maxima, narrower pitch span, and generally less variable pitch than Bulgarian and Polish speakers. These findings support the hypothesis that inguistic communities tend to be characterized by particular pitch profiles.
Based on specific linguistic landmarks in the speech signal, this study investigates pitch level and pitch span differences in English, German, Bulgarian and Polish. The analysis is based on 22 speakers per language (11 males and 11 females). Linear mixed models were computed that include various linguistic measures of pitch level and span, revealing characteristic differences across languages and between language groups. Pitch level appeared to have significantly higher values for the female speakers in the Slavic than the Germanic group. The male speakers showed slightly different results, with only the Polish speakers displaying significantly higher mean values for pitch level than the German males. Overall, the results show that the Slavic speakers tend to have a wider pitch span than the German speakers. But for the linguistic measure, namely for span between the initial peaks and the non-prominent valleys, we only find the difference between Polish and German speakers. We found a flatter intonation contour in German than in Polish, Bulgarian and English male and female speakers and differences in the frequency of the landmarks between languages. Concerning “speaker liveliness” we found that the speakers from the Slavic group are significantly livelier than the speakers from the Germanic group.
Prominence has been widely studied on the word level and the syllable level. An extensive study comparing the two approaches is missing in the literature. This study investigates how word and syllable prominence relate to each other in German. We find that perceptual ratings based on the word level are more extreme than those based on the syllable level. The correlations between word prominence and acoustic features are greater than the correlations between syllable prominence and acoustic features.
In our study we use the experimental framework of priming to manipulate our subjects’ expectations of syllable prominence in sentences with a well-defined syntactic and phonological structure. It shows that it is possible to prime prominence patterns and that priming leads to significant differences in the judgment of syllable prominence.
Streefkerk defines prominence as the perceptually outstanding parts in spoken language. An optimal rating scale for syllable prominence has not been found yet. This paper evaluates a 4-point, an 11-point, a 31-point, and a continuous scale for the rating of syllable prominence and gives support for scales using a higher number of levels. Priming effects found by Arnold, et al., could only be replicated using the 31-point scale.
The classification of verbs in Levin's (1993) English Verb Classes and Alternations: A preliminary Investigation, on the basis of both intuitive semantic grouping and their participation in valence alternations, is often used by the NLP community as evidence of the semantic similarity of verbs (Jing & McKeown 1998; Lapata & Brew 1999; Kohl et al. 1998). In this paper, we compare the Levin classification with the work of the FrameNet project (Fillmore & Baker 2001), where words (not just verbs) are grouped according to the conceptual structures (frames) that underlie them and their combinatorial patterns are inductively derived from corpus evidence. This means that verbs grouped together in FrameNet (FN) might be semantically similar but have different (or no) alternations, and that verbs which share the same alternation might be represented in two different semantic frames.