Korpuslinguistik
Refine
Year of publication
Document Type
- Part of a Book (200)
- Conference Proceeding (161)
- Article (105)
- Book (34)
- Part of Periodical (10)
- Other (9)
- Working Paper (7)
- Review (4)
- Doctoral Thesis (3)
- Preprint (3)
Language
- German (274)
- English (265)
- Multiple languages (1)
Keywords
- Korpus <Linguistik> (457)
- Deutsch (165)
- Gesprochene Sprache (64)
- Annotation (56)
- Forschungsdaten (36)
- Computerlinguistik (33)
- Korpuslinguistik (28)
- corpus linguistics (27)
- Deutsches Referenzkorpus (DeReKo) (25)
- Grammatik (25)
Publicationstate
- Veröffentlichungsversion (322)
- Zweitveröffentlichung (142)
- Postprint (23)
- Erstveröffentlichung (1)
Reviewstate
- (Verlags)-Lektorat (237)
- Peer-Review (202)
- Peer-review (5)
- Qualifikationsarbeit (Dissertation, Habilitationsschrift) (5)
- Zweitveröffentlichung (3)
- Abschlussarbeit (Bachelor, Master, Diplom, Magister) (Bachelor, Master, Diss.) (2)
- Verlags-Lektorat (2)
- Peer-reviewed (1)
- Review-Status-unbekannt (1)
- Verlagslektorat (1)
Publisher
- de Gruyter (81)
- Institut für Deutsche Sprache (58)
- Narr (33)
- European Language Resources Association (ELRA) (25)
- European Language Resources Association (24)
- Leibniz-Institut für Deutsche Sprache (IDS) (20)
- Narr Francke Attempto (15)
- Leibniz-Institut für Deutsche Sprache (11)
- Linköping University Electronic Press (10)
- CLARIN (8)
- IDS-Verlag (7)
- Universitäts- und Landesbibliothek Darmstadt (7)
- Erich Schmidt (6)
- Heidelberg University Publishing (6)
- Peter Lang (6)
- Association for Computational Linguistics (5)
- Editura Academiei Române (5)
- European language resources association (ELRA) (5)
- Wilhelm Fink (5)
- Gesellschaft für Sprachtechnologie und Computerlinguistik (4)
- Niemeyer (4)
- Nisaba (4)
- University of Birmingham (4)
- De Gruyter (3)
- ELRA (3)
- Extreme Markup Languages Conference (3)
- Lang (3)
- University of Illinois (3)
- University of Liverpool (3)
- Verlag für Gesprächsforschung (3)
- ACM (2)
- Edinburgh University Press (2)
- Eigenverlag ÖGAI (2)
- German Society for Computational Linguistics & Language Technology (GSCL) (2)
- Institut für Kommunikationsforschung und Phonetik (2)
- Institut für deutsche Sprache (2)
- Lexical Computing CZ s.r.o. (2)
- MDPI (2)
- Ruhr-Universität Bochum (2)
- Springer (2)
- Stauffenburg (2)
- Steiner (2)
- The Association for Computational Linguistics (2)
- UCREL (2)
- University of Antwerp (2)
- Universität Hamburg (2)
- Universitätsverlag Rhein-Ruhr (2)
- Universitätsverlag Rhein-Ruhr OHG (2)
- V&R unipress (2)
- Winter (2)
- Zenodo (2)
- Academia (1)
- Academic Publishing Division of the Faculty of Arts of the University of Ljubljana (1)
- Association for Computational Linguistics ( ACL ); Curran Associates, Inc. (1)
- Association for Computational Linguistics and Dublin City University (1)
- Austrian Centre for Digital Humanities, Austrian Academy of Sciences (1)
- Austrian academy of sciences (1)
- Beltz Juventa (1)
- Berlin-Brandenburgische Akademie der Wissenschaften (1)
- Bern Open Publishing (1)
- Bozen University Press (1)
- CECL Papers 1 (1)
- CEUR-WS (1)
- CSLI Publications (1)
- Campus (1)
- Cergy-Pontoise University, France (1)
- Charles University (1)
- Clarin (1)
- Coling 2010 Organizing Committee (1)
- Deutsche Gesellschaft für Sprachwissenschaft (1)
- Digital Curation Centre (1)
- Dublin City University (1)
- Editions Tradulex (1)
- Equinox (1)
- European Network of e-Lexicography (ENeL) (1)
- Europäische Akademie (1)
- Friedrich H. (1)
- GSCL (1)
- Hempen (1)
- IDS-Verlag; Leibniz-Institut für Deutsche Sprache (IDS) (1)
- Ids-Verlag (1)
- Institute for Specialised Communication and Multilingualism (1)
- Institute of the Polish Language (1)
- Ivane Javakhishvili Tbilisi State University (1)
- Izdatel´stvo Sankt-Peterburgskogo gosudarstvennogo universiteta (1)
- Jagiellonian University; Pedagogical University (1)
- John Benjamins Publishing Company (1)
- Johns Hopkins University Pres (1)
- Köllen (1)
- L'Harmattan (1)
- LINDAT/CLARIAH-CZ digital library (1)
- Lancaster University (1)
- LiU Electronic Press (1)
- Linköping University (1)
- Linköping University Electronic Press, Linköpings universitet (1)
- Ljubljana University Press (1)
- Mannheim (1)
- Martin-Luther-Universität Halle-Wittenberg (1)
- Metzler (1)
- Northern European Association for Language Technology (1)
- OSF Preprints, Center for Open Science (1)
- Office for Humanities Communication; Centre for Computing in the Humanities (King’s College London (1)
- Olms (1)
- Oxford University Press (1)
- Polish Information Processing Society (1)
- Presses universitaires de Louvain (1)
- Ruhr-Universität Bochum, Sprachwissenschaftliches Institut (1)
- Schneider Hohengehren (1)
- Scriptor (1)
- Slavistično društvo: Filozofska fakulteta (1)
- Spanish Association for Corpus Linguistics (1)
- Sprachenzentrum der Technischen Universität Darmstadt ; Universitäts- und Landesbibliothek Darmstadt (1)
- Sprachwissenschaftliches Institut, Ruhr-Universität Bochum (1)
- Springer Vieweg (1)
- Springer-Verlag (1)
- Stutz (1)
- Tokyo University of Foreign Studies (1)
- Univ. Karlova, Filozifická Fak. (1)
- University of Brimingham (1)
- University of Maribor (1)
- University of Nottingham (1)
- University of Tartu (1)
- University of Tübingen (1)
- Universität Hildesheim (1)
- Universität Kassel (1)
- Universität Siegen (1)
- Universität Tübingen (1)
- Universität Zürich (1)
- Universitätsverlag C. Winter (1)
- Universitätsverlag Hildesheim (1)
- Université Lumière (1)
- Université catholique de Louvain (1)
- Université de Strasbourg (1)
- Univerzita (1)
- Uniwersytet im. Adama Mickiewicza w Poznaniu (1)
- V&R Unipress (1)
- VS Verlag (1)
- Vandenhoeck & Ruprecht (1)
- Växjö University Press (1)
- Wiley (1)
- Znanstvena založba Filozofske fakultete Univerze v Ljubljani / Ljubljana University Press, Faculty of Arts (1)
- de Gruyter Mouton (1)
- enigma corporation (1)
- nisaba (1)
Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus
(2021)
Since the introduction of large language models in Natural Language Processing, large raw corpora have played a crucial role in Computational Linguistics. However, most of these large raw corpora are either available only for English or not available to the general public due to copyright issues. Nevertheless, there are some examples of freely available multilingual corpora for training Deep Learning NLP models, such as the OSCAR and Paracrawl corpora. However, they have quality issues, especially for low-resource languages. Moreover, recreating or updating these corpora is very complex. In this work, we try to reproduce and improve the goclassy pipeline used to create the OSCAR corpus. We propose a new pipeline that is faster, modular, parameterizable, and well documented. We use it to create a corpus similar to OSCAR but larger and based on recent data. Also, unlike OSCAR, the metadata information is at the document level. We release our pipeline under an open source license and publish the corpus under a research-only license.
COSMAS. Ein Computersystem für den Zugriff auf Textkorpora. Version R.1.3-1. Benutzerhandbuch
(1994)
Seit der Forschung große Datenmengen und Rechenkapazitäten zur Verfügung stehen arbeitet auch die Sprachwissenschaft zunehmend datengeleitet. Datengeleitete Forschung geht nicht von einer Hypothese aus, sondern sucht nach statistischen Auffälligkeiten in den Daten. Sprache wird dabei oft stark vereinfacht als lineare Abfolge von Wörtern betrachtet. Diese Studie zeigt erstmals, wie der zusätzliche Einbezug syntaktischer Annotationen dabei hilft, sprachliche Strukturen des Deutschen besser zu erfassen.
Als Anwendungsbeispiel dient der Vergleich der Wissenschaftssprachen von Linguistik und Literaturwissenschaft. Die beiden Fächer werden oft als Teildisziplinen der Germanistik zusammengefasst. Ihre wissenschaftliche Praxis unterscheidet sich jedoch systematisch hinsichtlich Forschungsdaten, Methoden und Erkenntnisinteressen, was sich auch in den Wissenschaftssprachen niederschlägt.
This paper presents the QUEST project and describes concepts and tools that are being developed within its framework. The goal of the project is to establish quality criteria and curation criteria for annotated audiovisual language data. Building on existing resources developed by the participating institutions earlier, QUEST develops tools that could be used to facilitate and verify adherence to these criteria. An important focus of the project is making these tools accessible for researchers without substantial technical background and helping them produce high-quality data. The main tools we intend to provide are the depositors’ questionnaire and automatic quality assurance, both developed as web applications. They are accompanied by a Knowledge base, which will contain recommendations and descriptions of best practices established in the course of the project. Conceptually, we split linguistic data into three resource classes (data deposits, collections and corpora). The class of a resource defines the strictness of the quality assurance it should undergo. This division is introduced so that too strict quality criteria do not prevent researchers from depositing their data.
This paper presents the QUEST project and describes concepts and tools that are being developed within its framework. The goal of the project is to establish quality criteria and curation criteria for annotated audiovisual language data. Building on existing resources developed by the participating institutions earlier, QUEST also develops tools that could be used to facilitate and verify adherence to these criteria. An important focus of the project is making these tools accessible for researchers without substantial technical background and helping them produce high-quality data. The main tools we intend to provide are a questionnaire and automatic quality assurance for depositors of language resources, both developed as web applications. They are accompanied by a knowledge base, which will contain recommendations and descriptions of best practices established in the course of the project. Conceptually, we consider three main data maturity levels in order to decide on a suitable level of strictness of the quality assurance. This division has been introduced to avoid that a set of ideal quality criteria prevent researchers from depositing or even assessing their (legacy) data. The tools described in the paper are work in progress and are expected to be released by the end of the QUEST project in 2022.
The CMDI Explorer
(2020)
We present the CMDI Explorer, a tool that empowers users to easily explore the contents of complex CMDI records and to process selected parts of them with little effort. The tool allows users, for instance, to analyse virtual collections represented by CMDI records, and to send collection items to other CLARIN services such as the Switchboard for subsequent processing. The CMDI Explorer hence adds functionality that many users felt was lacking from the CLARIN tool space.
CMDI Explorer
(2021)
We present CMDI Explorer, a tool that empowers users to easily explore the contents of complex CMDI records and to process selected parts of them with little effort. The tool allows users, for instance, to analyse virtual collections represented by CMDI records, and to send collection items to other CLARIN services such as the Switchboard for subsequent processing. CMDI Explorer hence adds functionality that many users felt was lacking from the CLARIN tool space.
This paper addresses long-term archival for large corpora. Three aspects specific to language resources are focused, namely (1) the removal of resources for legal reasons, (2) versioning of (unchanged) objects in constantly growing resources, especially where objects can be part of multiple releases but also part of different collections, and (3) the conversion of data to new formats for digital preservation. It is motivated why language resources may have to be changed, and why formats may need to be converted. As a solution, the use of an intermediate proxy object called a signpost is suggested. The approach will be exemplified with respect to the corpora of the Leibniz Institute for the German Language in Mannheim, namely the German Reference Corpus (DeReKo) and the Archive for Spoken German (AGD).