Digitale Sprachwissenschaft
Refine
Year of publication
Document Type
- Part of a Book (70)
- Conference Proceeding (49)
- Article (23)
- Book (15)
- Other (3)
- Working Paper (2)
- Master's Thesis (1)
- Part of Periodical (1)
- Report (1)
- Review (1)
Keywords
- Korpus <Linguistik> (79)
- Deutsch (45)
- Annotation (20)
- Computerlinguistik (18)
- Digital Humanities (14)
- Automatische Sprachanalyse (13)
- Computerunterstützte Kommunikation (12)
- Automatische Sprachverarbeitung (11)
- Wikipedia (8)
- CLARIN (7)
Publicationstate
- Veröffentlichungsversion (116)
- Zweitveröffentlichung (41)
- Postprint (15)
- Preprint (1)
Reviewstate
Publisher
- de Gruyter (20)
- European language resources association (ELRA) (11)
- The Association for Computational Linguistics (11)
- German Society for Computational Linguistics & Language Technology und Friedrich-Alexander-Universität Erlangen-Nürnberg (6)
- De Gruyter (5)
- Editura Academiei Române (5)
- Zenodo (5)
- Benjamins (4)
- CLARIN (4)
- Clarin (4)
Metadata is a topic of significant interest across all consortia, serving as a crucial link between them. When properly addressed, it enables consortia to share their specific needs and experiences, fostering collaboration and knowledge exchange.
This report summarizes the outcomes of the first NFDI Metadata Workshop, which took place on January 14-15, 2025, in Dresden, and was organized by the Taskforce Metadata. The workshop marked the beginning of a series of NFDI-wide metadata discussions aimed at developing joint recommendations for metadata schemas for datasets and the use of re3data as a central registry for repositories.
The negotiation of pronominal address on talk pages of the German, French, and Italian Wikipedia.
(2025)
The adequate use of social deixis is highly dependent on the situation and context and has therefore always been at the center of linguistic pragmatics. So far, principles of pronominal address have mainly been modelled with a focus on oral, co-present interaction. The use of pronominal address in computer-mediated communication (CMC) with its translocal and partially anonymous contexts is still a research gap.
This paper asks, from a contrastive perspective, how the appropriate use of address pronouns is negotiated on talk pages of the German, French, and Italian Wikipedia. The talk pages of Wikipedia share features of CMC genres such as a dia logic structure and an informal writing style with non-standard language. There are two types of Wikipedia talk pages, whose data are considered in this study based on the multilingual corpora by the Leibniz Institute for the German Language: article talk pages, where authors negotiate online encyclopedic content, and user talk pages, where the contributions of individual authors are discussed. These two types of talk pages will be analysed for the study.
Based on corpus data, it can be shown that the unidirectionality of this transition from the formal form (in German: Sie, in French: vous, in Italian: Lei) to the informal form (in German: du, in French: tu, in Italian: tu) in CMC is not always given. In both analysed Wikipedia subcorpora, i.e., the Wikipedia article talk pages on the one hand and the article talk pages on the other hand, a greater deal of discussions about addressing styles takes place on the user talk pages, with the
informal you variant being discussed more frequently than the formal you variant. Aspects of pronominal address among speakers of German, French and Italian are characterized by instability and uncertainty – especially in CMC. Moreover, it can be shown that Wikipedia authors discuss, among others, the reasoning behind their preference for a certain form as well as the notion of “distance” in which informal variants show familiarity which is not perceived as desirable by all users.
The extraction of lexical information for machine readable lexica from multilevel annotations is addressed in this paper. Relations between these levels of annotation are used for subclassification of lexical entries. A method for relating annotation units is presented, based on a temporal calculus. Relating the annotation units manually is errorprone, time consuming and tends to be inconsistent, and a method is presented for automatically accomplishing this task, and evaluated using German, Japanese and Anyi (W. Africa) corpora.
Concordancing is one of the oldest corpus analysis tools, especially for written corpora. In NLP concordancing appears intraining of speech-recognition system. Additionally, comparative studies of different languages result in parallel corpora. Concordancing for these corpora in a NLP context is a new approach. We propose to combine these fields of interest for a multi-purpose concordance for Spoken Language Data, opening the opportunity of combining corpus-linguistic and NLP methods resulting in a broader empirical basis for NLP research. Theoretic models for audio-concordances are discussed. Principles of the structure and design of a parallel audio concordance are given, coding by means of XML to ensure reusability and flexibility, using time stamps for referencing from annotations to the signal.
Exploring digitally-mediated communication with corpora. Methods, analyses, and corpus construction
(2025)
Specialized corpora of the language of Computer-mediated Communication and Social Media are increasingly vital for the analysis of the "unparalleled and rapidly evolving diversity in terms of speakers and settings" in digital contexts, as well as of "language evolution seen through the lens of user-generated content, which gives access to a number of variants, socio- and idiolects" (Barbaresi 2019: 29–30).
This volume brings together corpus-based, language-centered research on CMC and social media in linguistics, philologies, communication sciences, media, and social sciences with research questions from the fields of corpus and computational linguistics, language technology, text technology, and machine learning. It features research in which computational methods and tools are used for language-centered empirical analysis of CMC and social media phenomena as well as research on building, processing, annotating, representing, and exploiting CMC and social media corpora, including their integration in digital research infrastructures.
Currently no standardised gesture annotation systems are available. As a contribution towards solving this problem, CoGesT, a machine processable and human usable computational model for the annotation of a subset of conversational gestures is presented, its empirical and formal properties are detailed, and application areas are discussed.
Sprache und Diskurs
(2022)
Authentizität ist eine Reflexionskategorie und für die Linguistik von großem Interesse. Denn Sprache ist »das zentrale Medium bei der kommunikativen Authentizitätsherstellung«, die selbst immer zwischen den Polen der »Wahrhaftigkeit und Inszenierung« angesiedelt ist (Felder / Gardt 2016; 2020). Die Sprachwissenschaft hat insofern eine besondere Aufgabe in diesem Zusammenhang. Authentizität wird sprachlich-diskursiv erzeugt und ist ein Aushandlungsergebnis, Authentisierungsakte sind die Realisate der Aushandlung. Unter dieser Voraussetzung ist der Gegenstand gesellschaftlich eminent relevant. Das haben die Arbeiten deutlich gemacht, die den Authentizitätsbegriff in Kunst, Wissenschaft und Wirtschaft, in Kommunikation und Öffentlichkeit, sowie hinsichtlich semantischer Strukturen dargestellt und beschrieben haben.
Despite being an official language of several countries in Central and Western Europe, German is not formally recognised as the official language of the Federal Republic of Germany. However, in certain situations the use of the German language, including the spelling rules, is subject to state regulation (by acts of Federal Parliament or by administrative decisions). This article presents the content of this regulation, its scope, and the historical context in which it was adopted.
This paper quantifies textual patterns relating to gendered assumptions in a fairly unique text, an entire “women’s encyclopedia” from 1830’s Germany, which at 10 volumes and 1,461,000 word tokens was of comparable size to contemporary general encyclopedias, but written and marketed for a female audience. We perform experiments on classifying gender of biographical entries and querying a specific textual feature, calendar dates, with context from comparison 19th-20th century encyclopedias from the EncycNet corpus.