OPUS 4 | Korpuslinguistik

Korpuslinguistik

Refine

Has Fulltext

yes (263)
no (2)

Keywords

Korpus <Linguistik> (231)
Deutsch (53)
Annotation (42)
Gesprochene Sprache (31)
Forschungsdaten (27)
Computerlinguistik (21)
corpus linguistics (21)
Datenmanagement (20)
Computerunterstützte Kommunikation (15)
Corpus linguistics (15)
Automatische Sprachanalyse (13)
Corpus technology (12)
German (12)
Metadaten (12)
Deutsches Referenzkorpus (DeReKo) (11)
Text Encoding Initiative (10)
Digital Humanities (9)
CMC (8)
Kontrastive Linguistik (8)
TEI (8)
Texttechnologie (8)
Urheberrecht (8)
Datenbanksystem (7)
Historische Sprachwissenschaft (7)
Large corpora (7)
Syntaktische Analyse (7)
Syntax (7)
Corpus annotation (6)
Datenqualität (6)
Englisch (6)
Fremdsprachenlernen (6)
Grammatik (6)
Information Retrieval (6)
Kollokation (6)
Natürliche Sprache (6)
Polnisch (6)
Rumänisch (6)
Sprachdaten (6)
Sprachstatistik (6)
Sprachvariante (6)
Textkorpus (6)
Visualisierung (6)
YouTube (6)
computer-mediated communication (6)
corpus processing (6)
metadata (6)
Audiovisuelles Material (5)
Automatische Spracherkennung (5)
Corpus management (5)
Datensatz (5)
Institut für Deutsche Sprache <Mannheim> (5)
Jugendsprache (5)
Korpusanalyseplattform (KorAP) (5)
Korpuslinguistik (5)
National corpus (5)
Sprachwandel (5)
Textlinguistik (5)
Wörterbuch (5)
comparable corpora (5)
corpora (5)
corpus (5)
web corpora (5)
Chatten <Kommunikation> (4)
Data Mining (4)
Französisch (4)
Internet (4)
Langzeitarchivierung (4)
Maschinelles Lernen (4)
Metadatenmodell (4)
Methode (4)
Phonetik (4)
Phraseologie (4)
Prädikat (4)
Semasiologie (4)
Social Media (4)
Suchmaschine (4)
Text Encoding Initiative (TEI) (4)
XML (4)
corpus analysis (4)
corpus annotation (4)
corpus management (4)
legal issues (4)
oral corpora (4)
reply relations (4)
software (4)
youth language (4)
Abfragesprache (3)
Benutzeroberfläche (3)
CLARIN (3)
CoRoLa (3)
Computerunterstützte Lexikographie (3)
Corpus query language (3)
Daten (3)
Datenschutz-Grundverordnung (3)
Interoperabilität (3)
Kommunikation (3)
KorAP (3)
Lexikografie (3)
Morphologie <Linguistik> (3)
Negation (3)
Schriftsprache (3)
Software (3)
Sprachverarbeitung (3)
Strukturbaum (3)
Textsorte (3)
Web corpora (3)
Wikipedia (3)
Worthäufigkeit (3)
annotation (3)
copyright (3)
corpus infrastructures (3)
interoperability (3)
large corpora (3)
linguistic research software (3)
phonetics (3)
research infrastructure (3)
spoken German (3)
spoken language (3)
spoken language corpora (3)
tokenization (3)
treebanks (3)
ASR (2)
Abfrage (2)
Althochdeutsch (2)
Anonymisierung (2)
Auszeichnungssprache (2)
Automatische Textanalyse (2)
Concurrent Markup/Overlap (2)
Czech (2)
DMC (2)
Datenanalyse (2)
Datenbank (2)
Datenformat (2)
Datenschutz (2)
Digitale Kommunikation (2)
Diskursanalyse (2)
Dokumentation (2)
Dortmunder Chat-Korpus (2)
Europa (2)
Forschungsmethode (2)
Fremdsprachenunterricht (2)
GDPR (2)
Google Ngram Corpora (2)
ISO-Norm (2)
Informationstheorie (2)
Infrastruktur (2)
Interaktion (2)
Konferenz (2)
Kontrastive Grammatik (2)
KorAP (Korpusanalyseplattform der nächsten Generation) (2)
Language Variation (2)
Lehrmittel (2)
Leibniz-Institut für Deutsche Sprache (IDS) (2)
Literary corpus (2)
MTAS (2)
Mehrsprachigkeit (2)
Modeling (2)
Name (2)
NottDeuYTSch corpus (2)
NottDeuYTSch-Korpus (2)
Online-Datenbank (2)
Part-of-Speech-Tagging = POS (2)
Personenbezogene Daten (2)
Plurizentrische Sprache (2)
Polish (2)
Pronomen (2)
Recht (2)
Rechtsfrage (2)
Rechtsstellung (2)
Richtlinie (2)
Ripuarian (2)
Russisch (2)
Semantik (2)
Softwarewerkzeug (2)
Sprachgeografie (2)
Sprachtypologie (2)
Standardisierung (2)
Statistischer Test (2)
Syntagma (2)
Technologie (2)
Text Mining (2)
Textanalyse (2)
Transkription (2)
Tschechisch (2)
UGC (2)
Ungarisch (2)
Universal Dependencies (2)
Web (2)
Web Services (2)
Wissenschaftliche Kooperation (2)
Wortschatz (2)
Zipf’s law (2)
audiovisual data (2)
automatic transcription (2)
cmc corpora (2)
computer-mediated communication (CMC) (2)
contrastive linguistics (2)
corpus curation (2)
corpus linguistic methodology (2)
corpus reusability (2)
data quality (2)
deduplication (2)
dependency parsing (2)
diachronic corpus linguistics (2)
genre and register variation (2)
impact assessment (2)
information theory (2)
infrastructure (2)
language change (2)
language learning (2)
language resources (2)
lexical richness (2)
linguistic data (2)
long-term archival (2)
machine learning (2)
morphology (2)
n-grams (2)
noun–pronoun ratio (2)
null-hypothesis testing (2)
online language (2)
parallel corpora (2)
parser adaptation (2)
personal data (2)
phraseology (2)
pluricentric (2)
power law (2)
quantitative approaches (2)
query (2)
reference corpora (2)
representativeness (2)
reusability (2)
scalability (2)
software quality management (2)
speech corpus (2)
spoken language data (2)
syntactic complexity (2)
time series analysis (2)
type token ratio (2)
virtual collections (2)
vocabulary size (2)
word embeddings (2)
(un)certainty (1)
Access Control (1)
African languages dictionaries (1)
Aichinger, Ilse (1)
Akademischer Grad (1)
Akkusativ (1)
Algorithmus (1)
Ambiguität (1)
Amerikanisches Englisch (1)
Anapher <Syntax> (1)
Angewandte Linguistik (1)
Annotation guidelines (1)
Annotation of causal language (1)
Annotator Agreement (1)
Antonym (1)
Antwort (1)
Antwortrelationen (1)
Antwortstrukturen (1)
Archiv für Gesprochenes Deutsch (AGD) (1)
Archivierung (1)
Assoziationsmaß (1)
Auslassung (1)
Authentische Ressourcen (1)
Automatische Sprachanalyse; (1)
Bearbeitung von Korpusanfragen (1)
Bedeutungserweiterung (1)
Bedienungsanleitung (1)
Benutzerforschung (1)
Benutzerfreundlichkeit (1)
Bosnian (1)
Bosnisch (1)
Brown clustering (1)
CLARIAH-DE (1)
CLARIN infrastructure (1)
CLARIN-D (1)
CMC (International Conference on Cooperative Multimodal Communication) <2023, Mannheim> (1)
CMC Corpora (1)
CMC corpora (1)
CMC corpus (1)
CMDI metadata (1)
COVID-19 (1)
COVID-19 discourse (1)
CQLF (1)
CTS (1)
Canonical text services (1)
Clarin (1)
Cluster <Datenanalyse> (1)
Co-Reference (1)
CoMParS (1)
Collocations (1)
Communicative Functions (1)
Comparable Corpus (1)
Comparable corpora (1)
Computational linguistics (1)
Computer-Mediated Communication (1)
Computer-mediated communication (1)
Computergestützte Lexikographie (1)
Computerlingustik (1)
Computeruntertützte Lexikographie (1)
Contrastive linguistics (1)
Conversational Feedback (1)
Coreference (1)
Corpora (1)
Corpus Analysis (1)
Corpus Comparison (1)
Corpus Linguistics (1)
Corpus Management (1)
Corpus Tools (1)
Corpus query platform (1)
Croatian (1)
DARIAH-DE (1)
DRuKoLA (1)
DSSSL (1)
Data Architecture (1)
Data Formats (1)
Data Vizualization (1)
Dateiformat (1)
Datenbank für Gesprochenes Deutsch (DGD) (1)
Datensammlung (1)
Datenstruktur (1)
Dativ (1)
Decision Trees (1)
Dependency Parsing (1)
Deutsch als Fremdsprache (1)
Digitale Sprachressourcen (1)
Digitaler Sprachassistent (1)
Discourse analysis (1)
Discourse annotation (1)
Distribution <Linguistik> (1)
Distributional semantics (1)
Dzongkha (1)
EOSC (1)
Eigentumsrecht (1)
Einbettung <Linguistik> (1)
Elektronische Publikation (1)
Elektronisches Forum (1)
Endlicher Zustandsraum (1)
English (1)
Enzyklopädie (1)
Ergänzung <Linguistik> (1)
European Reference Corpus (EuReCo) (1)
Europäische Kommission. Digital Single Market (1)
Evaluation (1)
Evaluation methodologies (1)
FAIR (1)
FAIR Index (1)
FAIR data (1)
Fallstudie (1)
Forschungs- und Lehrkorpus Gesprochenes Deutsch (FOLK) (1)
Forschungsprojekt (1)
Frame-Semantik (1)
Framing-Effekt (1)
French (1)
French-German (1)
Futur (1)
GDE-V (1)
GOLD standard (1)
Genitive Classification (1)
GermaNet (1)
German Reference Corpus (DeReKo) (1)
German language (1)
German reference corpus (1)
German spoken language (1)
Geschlechterstereotyp (1)
Gigafida 2.1 corpus (1)
Gospel <Musik> (1)
Grammar (1)
Haftung (1)
Hamlet (1)
Hilfesystem (1)
Historical Corpora (1)
Historische Korpora (1)
Historische Lexikografie (1)
Hungarian (1)
Häufigkeitsverteilung (1)
ICC corpus (1)
ICE corpus (1)
ISO (1)
ISO/TC 37/SC 4 (1)
ISO/TEI (1)
IVK-Ler corpus of German (1)
Implementation (1)
Indikativ (1)
Indikator (1)
Informationsgehalt (1)
Informationsstruktur (1)
Interaktionsanalyse (1)
Interdisziplinarität (1)
International Conference on Language Resources and Evaluation (12. : 2020 : Marseille) (1)
International Corpus of English (1)
Interrelated document grammars (1)
Intertextuality (1)
Intertextualität (1)
Italian (1)
Italienisch (1)
Jugend (1)
Jugendlicher (1)
Kausalität (1)
Kiezdeutsch (1)
Klassifikation (1)
Kognitive Linguistik (1)
Kongress (1)
Kongressbericht (1)
Konjunktiv (1)
Kontrastive Morphologie (1)
Kontrastive Syntax (1)
Konversationsanalyse (1)
Korpusannotation (1)
Korpusmanagement (1)
Korpustechnologie (1)
Korpusvergleich (1)
Korrelationsanalyse (1)
Kroatisch (1)
Kymrisch (1)
L2 Russian (1)
LFG (1)
Large Corpora (1)
Lautschrift (1)
Lelxikographie (1)
Lemma (1)
Lemmata (1)
Levelled Study Corpus of Russian (LeStCoR) (1)
Lexical Semantics (1)
Lexical functional grammar (1)
Lexicon (1)
Lexikalisch funktionale Grammatik (1)
Lexikalische Analyse (1)
Lexikalische Semantik (1)
Lexikographie (1)
Literatur (1)
Logit-Modell (1)
Lower Sorbian (1)
Lyrics <Lyrik> (1)
MLSA (1)
Machine Leaming (1)
Machine translating (1)
Markup Languages (1)
Mehrworteinheit (1)
Metadata (1)
Metakommunikation (1)
Methodik (1)
Methodologie (1)
Minderheitensprache (1)
Modus (1)
Morphem (1)
Morphologie<Linguistik> (1)
Morphology (1)
Morphosyntax (1)
Multikulturelle Gesellschaft (1)
Multilingual Corpus (1)
Multilingual corpora (1)
Multilingual corpus (1)
Multilingualismus (1)
Multiple annotations (1)
MySQL (1)
Mündliche Kommunikation (1)
N-N compound (1)
N-gram modeling (1)
Nachhaltigkeit (1)
Named Entity Recognition (1)
Natural Language Processing (NLP) (1)
Negativer Polaritätsausdruck (1)
Negationen (1)
NeoRate (1)
Neologismus (1)
Newspaper (1)
Nominalisierung (1)
Nord-Sotho (1)
Northern Sotho (1)
NottDeuYTSch Corpus (1)
Nutzer (1)
Nutzungsrecht (1)
OAuth (1)
OTRS (1)
Online-Wörterbuch (1)
Onlinekommentare (1)
Ontologie <Wissensverarbeitung> (1)
Ontology (1)
Open Science (1)
Open Source (1)
Oral history (1)
PCFG (1)
POS-Tagging (1)
Paradigma (1)
Parallel corpora (1)
Paronym (1)
Parser (1)
Parsing (1)
Parsing Systems (1)
Part-of-Speech-Tagging (1)
Pedi-Sprache (1)
Performanz <Linguistik> (1)
Phrase <Syntagma> (1)
Polarity Shifter (1)
Polarität (1)
Poliqarp (1)
Polysemie (1)
Popmusik (1)
Pragmatik (1)
Privacy by Design (1)
Proposition (1)
Prosodie (1)
Präposition (1)
Psycholinguistik (1)
QUEST (1)
QUEST project (1)
Quantitative Analyse (1)
Query Languages (1)
Query Rewriting (1)
R <Programm> (1)
Rechtschreibung (1)
Rechtsschutz (1)
Redeerwähnung (1)
Reference Corpora (1)
Reflexitität <Linguistik> (1)
Register (1)
Regressionsanalyse (1)
Relation extraction (1)
Repository <Informatik> (1)
Rezeption (1)
Romanian corpus (1)
Rückmeldung (1)
SALSA (1)
SQL (1)
Satzende (1)
Schwedisch (1)
Schüler (1)
SciLogs (1)
Segmentierung (1)
Semantic Web (1)
Semantic analysis (1)
Semantic similarity (1)
Semantische Analyse (1)
Semantische Relation (1)
Sentiment Analysis (1)
Sentimentanalyse (1)
Sepedi (1)
Serbian (1)
Serbisch (1)
Service provider (1)
Shakespeare, William (1)
Sketch engine (1)
Slawistik (1)
Slovak (1)
Slowakisch (1)
Slowenisch (1)
Softwareergonomie (1)
Sorbian institute (1)
Sorbisch (1)
Soziale Software (1)
Soziolinguistik (1)
Speech Corpora (1)
Speech Lexica (1)
Sprachakt (1)
Sprachgebrauch (1)
Sprachgeschichte (1)
Sprachkontakt (1)
Sprachliche Minderheit (1)
Sprachübersetzung (1)
Sprecherwechsel (1)
Sprichwort (1)
Stadtmundart (1)
Statistische Linguistik (1)
Supervised Classification (1)
Swedish (1)
Synonym (1)
TEI encoding (1)
Tag (1)
Tagging (1)
Technische Infrastruktur (1)
Testdaten (1)
Text Classification (1)
Text Technology (1)
Text mining (1)
Text retrieval (1)
Text-to-Speech (1)
Textbaustein (1)
Textklassifikation (1)
Textklassifizierung (1)
Textverstehendes System (1)
Thematische Relation (1)
Token <Linguistik> (1)
Topikalisierung (1)
Topikmodellierung (1)
Treebank (1)
Treebanks (1)
Trees/Graphs (1)
Twitter (1)
Twitter <Softwareplattform> (1)
UseNet (1)
User Generated Content (1)
Validating (1)
Visualization (1)
Vorhersagbarkeit (1)
Walisisch (1)
Web corpus (1)
Web spam (1)
Weblog (1)
Welsh (1)
Wikipedia talk pages (1)
Wissenschaft (1)
Wissensrepräsentation (1)
Word associations (1)
World Wide Web (1)
Wortbildung (1)
Wortfeld (1)
Wortlänge (1)
Wortverbindung (1)
Wörterbücher afrikanischer Sprachen (1)
XQuery (1)
XQuery Full Text (1)
XSLT (1)
YouTube comments (1)
Zeitreihenanalyse (1)
Zeitschrift (1)
Zeitung (1)
Zertifizierung (1)
Zipf–Mandelbrot law (1)
Zustandsverb (1)
Zuverlässigkeit (1)
Zweitspracherwerb (1)
acquisition (1)
annotated corpora (1)
annotation guidelines (1)
annotation schema (1)
annotation tool (1)
anonymisation (1)
anonymization (1)
anotación de corpus (1)
application (1)
audio-visual data (1)
authentic language (1)
authentic materials (1)
big data (1)
blog corpus (1)
category detection (1)
causal tagger (1)
centres (1)
clause union (1)
clitic climbing (1)
collocation analysis (1)
collocations (1)
colonial language contact (1)
comparable corpus (1)
compatibility (1)
compound family (1)
compound interpretation (1)
compression (1)
comunicación mediada por computadora (CMC) (1)
corpus linguistics (1)
corpus CMC (1)
corpus access (1)
corpus analysis tools (1)
corpus architecture (1)
corpus compilation (1)
corpus creation (1)
corpus driven approach (1)
corpus frequency (1)
corpus management systems (1)
corpus query processing (1)
corpus querying (1)
corpus retrieval (1)
corpus search engine (1)
corpus search platform (1)
corpus size (1)
corpus storage (1)
corpus-based lexicography (1)
corpus-lexicographic tool (1)
cross-linguistic data (1)
curation (1)
data (1)
data curation (1)
data dissemination (1)
data provision (1)
data referencing (1)
data sustainability (1)
database (1)
database systems (1)
dataset (1)
didactic corpus (1)
digital collocation database (1)
digital communication (1)
digital humanities (1)
digitally-mediated communication (1)
discourse dictionary (1)
discovering collocations in corpora (1)
distributional semantics (1)
domain-specific solutions (1)
e-lexicography (1)
eLexiko (1)
electronic lexicography (1)
exploration of CMDI metadata (1)
extensibility (1)
finite state (1)
finite state tokenization (1)
format migration (1)
framing (1)
gender equality (1)
gender linguistics (1)
gesprochene Sprache (1)
grammar acquisistion (1)
grammar and syntax (1)
help desk (1)
historical corpora (1)
historical lexicography (1)
historische Lexikographie (1)
history of science (1)
impact categories (1)
information density (1)
information retrieval (1)
infrastructure technology (1)
infrastructures and architectures (1)
intensification (1)
interactional linguistics (1)
international comparable corpus (1)
internetbasierte Kommunikation (1)
internetbasierte Kommunikation (IBK) (1)
language (1)
language corpora (1)
language data (1)
language documentation (1)
language fixedness (1)
language modelling (1)
language models (1)
language portal (1)
language variation (1)
large corpus data (1)
large-scale corpora (1)
law (1)
learner corpora (1)
learner corpus (1)
learner corpus of adolescent (1)
lexical data (1)
lexical information system (1)
lexical level (1)
lexical-functional grammar (1)
lexicography (1)
lexicography equality (1)
lexis (1)
liability (1)
linguistique interactionnelle (1)
log file analysis (1)
logistic regression (1)
longitudinal study (1)
machine learning methods (1)
metacommunication (1)
metadata analysis (1)
metadata standards (1)
microservices (1)
minority language (1)
morphological level (1)
multi-layer corpora (1)
multi-turn conversations (1)
multidisciplinarity (1)
multilingual corpora (1)
multilingualism (1)
multilinguality (1)
multiword expressions (1)
mysql (1)
naming (1)
national corpora (1)
native speech (1)
natural language processing (1)
negation (1)
negation Raising (1)
neologism detection (1)
non-native speech (1)
oral corpus platform (1)
oral history corpora (1)
oral language (1)
orthography (1)
parser evaluation (1)
parsing (1)
part-of-speech (POS) (1)
part-of-speech tagging (1)
pattern-based lexicography (1)
patterns (1)
performativity (1)
persistent identifiers (1)
phonetic databases (1)
polysemy (1)
pop lyrics (1)
predictive approach (1)
productivity (1)
project report (1)
propositional argument (1)
proverb (1)
quality checking (1)
quality evaluation (1)
query building (1)
query language (1)
query languages (1)
reference corpus (1)
referencing strategies (1)
relaciones de respuesta (1)
relational database (1)
repository (1)
research data (1)
research infrastructures (1)
research into dictionary use (1)
research reports (1)
research tools (1)
resources (1)
reusability of research data (1)
search engine (1)
second language acquisition (1)
semantic extension (1)
semiotics (1)
sentence boundary detection (1)
sentiment (1)
sentiment analysis (1)
sharing data (1)
social media interaction (1)
social relevance (1)
sociolinguistics (1)
space-delimited languages (1)
specialist corpora (1)
speech thought writing representation (1)
spoken corpora (1)
spoken language transcripts (1)
spoken vs. written (1)
stance (1)
standard (1)
standardization (1)
standards (1)
standoff annotation (1)
statistical significance (1)
subjunctive (1)
sustainability (1)
syntactic competence (1)
syntactical level (1)
syntax (1)
tagging (1)
text classification (1)
text complexity (1)
text corpus (1)
time-series analysis (1)
topic models (1)
transcription (1)
type frequency (1)
under-resourced language varieties (1)
uniform information density (1)
urban youth language (1)
usability (1)
usage-based model (1)
user interface (1)
user interface design (1)
user support (1)
user survey (1)
very large corpora (1)
virtual corpus (1)
visualisation (1)
vocabulary growth (1)
vocabulary of quotation expressions (1)
web crawling (1)
web data (1)
web service (1)
word formation in German (1)
word frequency distribution (1)
youth (1)
Ähnlichkeitssuche (1)
Öffentlichkeit (1)
Österreich (1)
Übersetzungswissenschaft (1)

- less

265 search hits

1 to 10

Sort by

Corpus size strongly matters when analysing word frequency distributions (2024)

Koplenig, Alexander

In a previous study, Aceves and Evans present a large-scale quantitative information-theoretic analysis of parallel corpus data in ~1,000 languages to show that there are apparently strong associations between the way languages encode information into words and patterns of communication, e.g. the configuration of semantic information. During the peer review process, one reviewer raised the question of the extent to which the presented results depend on different corpus sizes (see the Peer Review File). This is a very important question given that most, if not all, of the quantities associated with word frequency distributions vary systematically with corpus size. While Aceves and Evans claim that corpus size does not affect the results presented, I challenge this view by presenting reanalyses of the data that clearly suggest that it does.

To drop or not to drop? Predicting the omission of the infinitival marker in a Swedish future construction (2024)

Berdicevskis, Aleksandrs ; Coussé, Evie ; Koplenig, Alexander ; Adesam, Yvonne

We investigate the optional omission of the infinitival marker in a Swedish future tense construction. During the last two decades the frequency of omission has been rapidly increasing, and this process has received considerable attention in the literature. We test whether the knowledge which has been accumulated can yield accurate predictions of language variation and change. We extracted all occurrences of the construction from a very large collection of corpora. The dataset was automatically annotated with language-internal predictors which have previously been shown or hypothesized to affect the variation. We trained several models in order to make two kinds of predictions: whether the marker will be omitted in a specific utterance and how large the proportion of omissions will be for a given time period. For most of the approaches we tried, we were not able to achieve a better-than-baseline performance. The only exception was predicting the proportion of omissions using autoregressive integrated moving average models for one-step-ahead forecast, and in this case time was the only predictor that mattered. Our data suggest that most of the language-internal predictors do have some effect on the variation, but the effect is not strong enough to yield reliable predictions.

Using YouTube comments as teaching material: The benefits of the NottDeuYTSch corpus (2023)

Cotgrove, Louis

Developments within the field of Second Language Acquisition (SLA) have meant that scholars are increasingly engaging with corpora and corpus-based resources, providing a source of “‘authentic’ language” to learners and educators (Mitchell 2020: 254), and contributing to “state-of-the-art research methodologies” (Deshors and Gries 2023: 164). However, there are areas in which progress can still be made, particularly in the area of metadata, such as information about the speaker and contexts of the language use, as well as increased variety in the text types and genres of corpora used to develop SLA materials (Paquot 2022: 36). This post discusses one such possibility for increasing the variety of text types and providing a rich source of authentic language that can be used to create engaging SLA materials, particularly for young people learning German, namely the use of the NottDeuYTSch corpus (to download the corpus in a variety of formats, see Cotgrove 2018).

First users’ interactions with voice-controlled virtual assistants: A micro-longitudinal corpus study (2023)

Barthel, Mathias ; Helmer, Henrike ; Reineke, Silke

We present a collection of (currently) about 5.500 commands directed to voice-controlled virtual assistants (VAs) by sixteen initial users of a VA system in their homes. The collection comprises recordings captured by the VA itself and with a conditional voice recorder (CVR) selectively capturing recordings including the VA-directed commands plus some surrounding context. Next to a description of the collection, we present initial findings on the patterns of use of the VA systems during the first weeks after installation, including usage timing, the development of usage frequency, distributions of sentence structures across commands, and (the development of) command success rates. We discuss the advantages and disadvantages of the applied collection-specific recording approach and describe potential research questions that can be investigated in the future, based on the collection, as well as the merit of combining quantitative corpus linguistic approaches with qualitative in-depth analyses of single cases.

megageil, mega geil, and voll mega: Intensification in YouTube comments (2023)

Cotgrove, Louis

This paper analyses intensification in German digitally-mediated communication (DMC) using a corpus of YouTube comments written by young people (the NottDeuYTSch corpus). Research on intensification in written language has traditionally focused on two grammatical aspects: syntactic intensification, i.e. the use of particles and other lexical items and morphological intensification, i.e. the use of compounding. Using a wide variety og examples from the corpus, the paper identifies novel ways that have been used for intensification in DMC, and suggests a new taxonomy of classification for future analysis of intensification.

The IVK-Ler corpus of adolescent foreign-language learners of German (2023)

Pushkina, Alexandra ; Hinrichs, Erhard

This paper presents the IVK-Ler corpus, a longitudinal, annotated learner corpus of weekly writings produced by a group of 18 adolescents in a preparatory class. The corpus consists of 117 student texts collected between 2020 and 2021 and has a structure layered by student and text number. It includes metadata that enables researchers to analyze and track individual student progress in terms of syntactic competence and literacy. The annotation schema, manual and automatic annotation processes, and corpus representation are described in detail. The corpus currently includes target hypotheses and gold standard part-of-speech tags. Future work could include additional annotation layers for topological fields and dependency relations, as well as semantic and discourse annotations to make the corpus usable for tasks beyond syntactic evaluations.

A distributional comparison between FOLK and DeReKo (2023)

Kupietz, Marc ; Fankhauser, Peter ; Ruppenhofer, Josef

Introducing DeReKoGram: A novel frequency dataset with lemma and part-of-speech information for German (2023)

Wolfer, Sascha ; Koplenig, Alexander ; Kupietz, Marc ; Müller-Spitzer, Carolin

We introduce DeReKoGram, a novel frequency dataset containing lemma and part-of-speech (POS) information for 1-, 2-, and 3-grams from the German Reference Corpus. The dataset contains information based on a corpus of 43.2 billion tokens and is divided into 16 parts based on 16 corpus folds. We describe how the dataset was created and structured. By evaluating the distribution over the 16 folds, we show that it is possible to work with a subset of the folds in many use cases (e.g., to save computational resources). In a case study, we investigate the growth of vocabulary (as well as the number of hapax legomena) as an increasing number of folds are included in the analysis. We cross-combine this with the various cleaning stages of the dataset. We also give some guidance in the form of Python, R, and Stata markdown scripts on how to work with the resource.

Data of German speech minorities in the Archive for Spoken German: an overview (2023)

Gorisch, Jan ; Schmidt, Thomas ; Stift, Ulf Michael

Speech islands are historically and developmentally unique and will inevitably disappear within the next decades. We urgently need to preserve their remains and exploit what is left in order to make research on language-in-contact and historical as well as current comparative language research possible. The Archive for Spoken German (AGD) at the Institute for German Language collects, fosters and archives data from completed research projects and makes them available to the wider research community. Besides large variation corpora and corpora of conversational speech, the archive already contains a range of collections of data on German speech minorities. The latter will be outlined in this chapter. Some speech island data is already made available through the personal service of the AGD, or the database of spoken German (DGD), e.g. data on Australian German, Unserdeutsch, or German in North America. Some corpora are still being prepared for publication, but still important to document for potentially interested research projects. We therefore also explain the current problems and efforts related to the curation of speech island data, from the digitization of recordings and the collection of metadata, to the integration of transcriptions, annotations and other ways of accessing and sharing data.

Studying the distribution of reply relations in Wikipedia talk pages (2023)

Lüngen, Harald ; Herzberg, Laura

This paper presents an extended annotation and analysis of interpretative reply relations focusing on a comparison of reply relation types and targets between conflictual pages and neutral pages of German Wikipedia (WP) talk pages. We briefly present the different categories identified for interpretative reply relations to analyze the relationship between WP postings as well as linguistic cues for each category. We investigate referencing strategies of WP authors in discussion page postings, illustrated by means of reply relation types and targets taking into account the degree of disagreement displayed on a WP talk page. We provide richly annotated data that can be used for further analyses such as the identification of interactional relations on higher levels, or for training tasks in machine learning algorithms.

1 to 10

Open Access

Korpuslinguistik

Refine

Author

Year of publication

Document Type

Language

Has Fulltext

Is part of the Bibliography

Keywords

Publicationstate

Reviewstate

Publisher

265 search hits