Refine
Year of publication
Document Type
- Conference Proceeding (688) (remove)
Keywords
- Korpus <Linguistik> (237)
- Deutsch (167)
- Computerlinguistik (117)
- Annotation (65)
- Automatische Sprachanalyse (53)
- Gesprochene Sprache (53)
- Natürliche Sprache (41)
- Forschungsdaten (38)
- Information Extraction (30)
- Metadaten (30)
Publicationstate
- Veröffentlichungsversion (442)
- Zweitveröffentlichung (81)
- Postprint (38)
- Preprint (1)
Reviewstate
- Peer-Review (328)
- (Verlags)-Lektorat (137)
- Peer-review (9)
- Review-Status-unbekannt (7)
- Peer review (1)
- Verlags-Lektorat (1)
Publisher
- European Language Resources Association (ELRA) (50)
- Association for Computational Linguistics (43)
- European Language Resources Association (35)
- Institut für Deutsche Sprache (17)
- Zenodo (15)
- Lexical Computing CZ s.r.o. (12)
- Linköping University Electronic Press (12)
- CLARIN (11)
- International Speech Communication Association (9)
- Leibniz-Institut für Deutsche Sprache (9)
- Springer (8)
- The Association for Computational Linguistics (8)
- ELRA (7)
- German Society for Computational Linguistics & Language Technology und Friedrich-Alexander-Universität Erlangen-Nürnberg (7)
- Extreme Markup Languages Conference (6)
- LiU Electronic Press (6)
- Trojina, Institute for Applied Slovene Studies (6)
- University of Birmingham (6)
- CSLI Publications (5)
- International Speech Communications Association (5)
- Nisaba (5)
- TUDpress (5)
- University of Illinois (5)
- Universitätsverlag Hildesheim (5)
- Deutsche Gesellschaft für Sprachwissenschaft (4)
- EURALEX (4)
- Narr (4)
- University of Tübingen (4)
- Universität Hamburg (4)
- Universität Hildesheim (4)
- Universität zu Köln (4)
- ACL (3)
- ACM (3)
- Association for Computing Machinery (3)
- Clarin (3)
- Edusp/Monferrer Produções (3)
- Evangelische Akademie Loccum (3)
- Fryske Akademy (3)
- Gesellschaft für Informatik e.V. (3)
- IDS-Verlag (3)
- Incoma Ltd. (3)
- International Phonetic Association (3)
- Ivane Javakhishvili Tbilisi State University (3)
- Niemeyer (3)
- Northern European Association for Language Technology (3)
- Peter Lang (3)
- Trojina, Institute for Applied Slovene Studies/Eesti Keele Instituut (3)
- University of Liverpool (3)
- University of Oulu (3)
- Universität Tübingen (3)
- Aisthesis (2)
- Asian Federation of Natural Language Processing (2)
- Association of Internet Researchers (2)
- Berkeley Linguistics Society (2)
- CEUR-WS (2)
- Dagstuhl (2)
- Department of Linguistics and Scandinavian Studies, University of Oslo (2)
- EACL (2)
- EURAC Research (2)
- Edizioni dell'Orso Alessandria (2)
- Eigenverlag ÖGAI (2)
- Euralex (2)
- German Society for Computational Linguistics & Language Technology (GSCL) (2)
- Gesellschaft für Informatik (2)
- Gesellschaft für Sprachtechnologie and Computerlinguistik (2)
- ICCC Press (2)
- INCOMA Ltd. (2)
- ISCA (2)
- Institute for Specialised Communication and Multilingualism (2)
- Leibniz-Institut für Deutsche Sprache (IDS) (2)
- McGill University & Université de Montréal (2)
- Technische Informationsbibliothek (2)
- Universidade de Santiago de Compostela (2)
- University of Glasgow (2)
- University of Pittsburgh (2)
- Universität Hamburg - Sonderforschungsbereich 538 (2)
- Universität Potsdam (2)
- Universitätsbibliothek der Ludwig-Maximilians-Universität (2)
- enigma corporation (2)
- Österreichische Gesellschaft für Artificial Intelligence (2)
- AAAI Press (1)
- ACTA Press (1)
- AKS-Verlag (1)
- Academic Publishing Division of the Faculty of Arts of the University of Ljubljana (1)
- Acta Press (1)
- Acta Universitatis Upsaliensis (1)
- Arbeitskreis Deutsch als Fremdsprache beim DAAD (AKDaF) (1)
- Association for Computational (1)
- Association for Computational Linguistics ( ACL ); Curran Associates, Inc. (1)
- Association for Computational Linguistics and Dublin City University (1)
- Association pour l'Avancement des Etudes Iraniennes (1)
- Austrian Centre for Digital Humanities, Austrian Academy of Sciences (1)
- Austrian academy of sciences (1)
- BKA (1)
- Benjamins (1)
- Berkeley Linguistics Society, Inc. (1)
- Berlin-Brandenburgische Akademie der Wissenschaften (1)
- Berlin-Brandenburgische Akademie der Wissenschaften; Zentrum Sprache (1)
- Bibliothek der Universität Konstanz (1)
- Bozen University Press (1)
- Bulgarian Academy of Sciences (1)
- Bundeskriminalamt (1)
- CEUR-WS.org (1)
- CLARIN-D (1)
- CSLI (1)
- California State University (1)
- Cergy-Pontoise University, France (1)
- Charles University (1)
- City University of Hong Kong (1)
- Coling 2010 Organizing Committee (1)
- DFG Schwerpunktprogramm 1727 (XPrag.de), Zentrum für Allgemeine Sprachwissenschaft (ZAS) (1)
- DFKI GmbH (1)
- DGPF e.V. (1)
- Democritus University of Thrace (1)
- Department of Linguistics, University of California (1)
- Department of Linguistics, University of Cambridge (1)
- Department of Phonetics, Trier University (1)
- Deseret Language and Linguistics Society (1)
- Dublin City University (1)
- E-MELD (1)
- EDP Sciences (1)
- EDUCatt (1)
- EFNIL (1)
- EPFL/UNIL (1)
- EURAC research (1)
- Eberhard Karls Universität (1)
- Editions Tradulex (1)
- Editorial Complutense (1)
- Editorial Universitat Politècnica de València (1)
- Ediçoes Colibri (1)
- Edusp (1)
- Ege Üniversitesi Matbaası (1)
- Europ. Akad. (1)
- European Network of e-Lexicography (ENeL) (1)
- FOSSGIS e.V. (1)
- Fachverband Deutsch als Fremdsprache (FaDaF) (1)
- Fink (1)
- Foi-Commerce (1)
- Friedrich Schiller Universität (1)
- Fryske Akademy – Afûk (1)
- Fundacja Uniwersytetu im. Adama Mickiewicza (1)
- GLSA Publications (1)
- Gardez!-Verl. (1)
- Gesellschaft für Sprachtechnologie and Computerlinguistik e.V. (1)
- Groos (1)
- Hungarian Research Centre for Linguistics (1)
- ICOMANIA Ltd. (1)
- IEEE (1)
- INRIA (1)
- IOS Press (1)
- IPrA (International Pragmatics Association) (1)
- IRIT (1)
- Institut Universitari de Linguistica Aplicada, Universitat Pompeu Fabra (1)
- Institut Universitari de Linguistica Aplicada, Universitat Pompeu Fabra: (1)
- Institut Universitari de Lingüistica Aplicada, Universitat Pompeu Fabra (1)
- Institut de Linguistique Française (1)
- Institut für Bildungsmedien e.V. (1)
- Institut für Informationswissenschaft und Sprachtechnologie, Universität Hildesheim (1)
- Institut für Kognitionswissenschaft Universität Osnabrück (1)
- Institut für Maschinelle Sprachverarbeitung (1)
- Institut für Phonetik und Sprachverarbeitung, Universität München (1)
- Institut für Romanistik der Universität Wien & Institut für Schallforschung der Österreichischen Akademie der Wissenschaften (1)
- Institute for Logic, Language and Computation (1)
- Institute of Cybernetics, Institute of the Estonian Language (1)
- Instytut Podstaw Informatyki Polskiej Akademii Nauk (1)
- International Committee on Computational Linguistics (1)
- International Phonetic Association (IPA) (1)
- Ivane Javakhishvili Tbilisi State University Press (1)
- Jagiellonian University; Pedagogical University (1)
- Klincksieck (1)
- L'Harmattan (1)
- LIRMM (1)
- LOT (1)
- La Rochelle University (1)
- Lambert-Lucas (1)
- Lancaster University (1)
- Lang (1)
- Las Palmas (1)
- Linguistic Society of America (1)
- Linköping University (1)
- Linköping University Electronic Press, Linköpings universitet (1)
- Ljubljana University Press (1)
- Narr Francke Attempto (1)
- Neisse | Oficyna Wydawnicza ATUT (1)
- North-West University (1)
- Nyelvtudományi Kutatóközpont / Hungarian Research Centre for Linguistics (1)
- Office for Humanities Communication; Centre for Computing in the Humanities (King’s College London (1)
- Open University of the Netherlands (1)
- Pabst (1)
- Pasithee: Open Access Electronic Publications (1)
- Penn Linguistics Club (1)
- Presses Universitaires de Louvain (1)
- Presses universitaires de Louvain (1)
- Research Institute for Linguistics, Hungarian Academy of Sciences (HAS), and Theoretical Linguistics Program, Eötvös Loránd University (ELTE) (1)
- Rezekne Academy of Technologies (1)
- Ruhr-Universität Bochum, Sprachwissenschaftliches Institut (1)
- SciTePress (1)
- SemDial (1)
- Sic Sat (1)
- Springer International Publishing (1)
- Stanford University Library (1)
- Stockholm University (1)
- Stroudsburg (1)
- Tampereen Yliopisto (University of Tampere) (1)
- The Association for Computational Linguistics and The Asian Federation of Natural Processing (1)
- Tongji University Press (1)
- Trojina, Institute for Applied Slovene StudiesTrojina, Institute for Applied Slovene Studies (1)
- Tsinghua University Press (1)
- UCL Presses Universitaires (1)
- UCREL (1)
- Uitgeverij Vantilt (1)
- Ukrainischer Deutschlehrer – und Germanistenverband (1)
- Univ. Pompeu Fabra (1)
- Universidad de Alicante (1)
- Universidad de La Habana (1)
- Universidad de Las Palmas de Gran Canaria (1)
- Universidade de Brasília (1)
- Universita degli Studi di Bologna (1)
- Universitat Pompeu Fabra (1)
- University College London and Queen Mary University of London (1)
- University of Antwerp (1)
- University of Brimingham (1)
- University of Göteborg (1)
- University of Helsinki (1)
- University of Joensuu, Faculty of Humanities (1)
- University of Lancaster (1)
- University of Leipzig (1)
- University of Maribor (1)
- University of Paderborn (1)
- University of Patras (1)
- University of Pennsylvania - Institute for Research in Cognitive Science (1)
- University of Tartu (1)
- University of Texas (1)
- University of Victoria (1)
- Universität (1)
- Universität Konstanz (1)
- Universität Krakau (1)
- Universität Lausanne (1)
- Universität Zürich (1)
- Universität des Saarlandes (1)
- Universitätsbibliothek Frankfurt am Main (1)
- Universitätsbibliothek Johann Christian Senckenberg (1)
- Universitätsverlag (1)
- Universitätsverlag Siegen (1)
- Université catholique de Louvain (1)
- Université de Strasbourg (1)
- Uniwersytet im. Adama Mickiewicza w Poznaniu (1)
- Växjö University Press (1)
- Verlag für Gesprächsforschung (1)
- Werner Hülsbusch (1)
- Wydawnictwo Poznańskie (1)
- ZDV Universität Tübingen (1)
- düsseldorf university press (1)
- heiBOOKS (1)
- iudicium (1)
- nisaba (1)
- tcworld (1)
- Österreichische Ludwig-Wittgenstein-Gesellschaft (1)
Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus
(2021)
Since the introduction of large language models in Natural Language Processing, large raw corpora have played a crucial role in Computational Linguistics. However, most of these large raw corpora are either available only for English or not available to the general public due to copyright issues. Nevertheless, there are some examples of freely available multilingual corpora for training Deep Learning NLP models, such as the OSCAR and Paracrawl corpora. However, they have quality issues, especially for low-resource languages. Moreover, recreating or updating these corpora is very complex. In this work, we try to reproduce and improve the goclassy pipeline used to create the OSCAR corpus. We propose a new pipeline that is faster, modular, parameterizable, and well documented. We use it to create a corpus similar to OSCAR but larger and based on recent data. Also, unlike OSCAR, the metadata information is at the document level. We release our pipeline under an open source license and publish the corpus under a research-only license.
The changes caused by the growing automatisation of processes in the lexicographer´s workstation and in lexicographic work, together with the ensuing needs of lexicographers and their demands for adequately targeted software, have not been discussed sufficiently in meta-lexicographic research. The aim of this paper is therefore to fill this gap, with a focus on academic non-commercial lexicography. After an introduction into the general functionalities of specific dictionary writing software, with the help of a real-life example we will discuss the lexicographic working environment, the new specific demands to lexicographic software as well as different tools. The final aim is to propose some recommendations for how to structure the lexicographic working environment to meet specific project requirements.
This paper outlines the generation process of a specifi computational linguistic representation termed the Multilingual Time Map, conceptually a multi-tape finit state transducer encoding linguistic data at different levels of granularity. The fi st component acquires phonological data from syllable labeled speech data, the second component define feature profiles the third component generates feature hierarchies and augments the acquired data with the define feature profiles and the fourth component displays the Multilingual Time Map as a graph.
Preface
(2019)
Preface
(2020)
The automatic recognition of idioms poses a challenging problem for NLP applications. Whereas native speakers can intuitively handle multiword expressions whose compositional meanings are hard to trace back to individual word semantics, there is still ample scope for improvement regarding computational approaches. We assume that idiomatic constructions can be characterized by gradual intensities of semantic non-compositionality, formal fixedness, and unusual usage context, and introduce a number of measures for these characteristics, comprising count-based and predictive collocation measures together with measures of context (un)similarity. We evaluate our approach on a manually labelled gold standard, derived from a corpus of German pop lyrics. To this end, we apply a Random Forest classifier to analyze the individual contribution of features for automatically detecting idioms, and study the trade-off between recall and precision. Finally, we evaluate the classifier on an independent dataset of idioms extracted from a list of Wikipedia idioms, achieving state-of-the art accuracy.
In order to differentiate between figurative and literal usage of verb-noun combinations for the shared task on the disambiguation of German Verbal Idioms issued for KONVENS 2021, we apply and extend an approach originally developed for detecting idioms in a dataset consisting of random ngram samples. The classification is done by implementing a rather shallow, statistics-based pipeline without intensive preprocessing and examinations on the morphosyntactic and semantic level. We describe the overall approach, the differences between the original dataset and the dataset of the KONVENS task, provide experimental classification results, and analyse the individual contributions of our feature sets.
This study investigates cross-language differences in pitch range and variation in four languages from two language groups: English and German (Germanic) and Bulgarian and Polish (Slavic). The analysis is based on large multi-speaker corpora (48 speakers for Polish, 60 for each of the other three languages). Linear mixed models were computed that include various distributional measures of pitch level, span and variation, revealing characteristic differences across languages and between language groups. A classification experiment based on the relevant parameter measures (span, kurtosis and skewness values for pitch distributions for each speaker) succeeded in separating the language groups.
This study presents the results of a large-scale comparison of various measures of pitch range and pitch variation in two Slavic (Bulgarian and Polish) and two Germanic (German and British English) languages. The productions of twenty-two speakers per language (eleven male and eleven female) in two different tasks (read passages and number sets) are compared. Significant differences between the language groups are found: German and English speakers use lower pitch maxima, narrower pitch span, and generally less variable pitch than Bulgarian and Polish speakers. These findings support the hypothesis that inguistic communities tend to be characterized by particular pitch profiles.
Based on specific linguistic landmarks in the speech signal, this study investigates pitch level and pitch span differences in English, German, Bulgarian and Polish. The analysis is based on 22 speakers per language (11 males and 11 females). Linear mixed models were computed that include various linguistic measures of pitch level and span, revealing characteristic differences across languages and between language groups. Pitch level appeared to have significantly higher values for the female speakers in the Slavic than the Germanic group. The male speakers showed slightly different results, with only the Polish speakers displaying significantly higher mean values for pitch level than the German males. Overall, the results show that the Slavic speakers tend to have a wider pitch span than the German speakers. But for the linguistic measure, namely for span between the initial peaks and the non-prominent valleys, we only find the difference between Polish and German speakers. We found a flatter intonation contour in German than in Polish, Bulgarian and English male and female speakers and differences in the frequency of the landmarks between languages. Concerning “speaker liveliness” we found that the speakers from the Slavic group are significantly livelier than the speakers from the Germanic group.