Korpuslinguistik
Refine
Year of publication
Document Type
- Conference Proceeding (150)
- Part of a Book (59)
- Article (33)
- Book (12)
- Preprint (3)
- Working Paper (3)
- Doctoral Thesis (2)
- Master's Thesis (1)
- Other (1)
- Part of Periodical (1)
Language
- English (265) (remove)
Keywords
- Korpus <Linguistik> (231)
- Deutsch (53)
- Annotation (42)
- Gesprochene Sprache (31)
- Forschungsdaten (27)
- Computerlinguistik (21)
- corpus linguistics (21)
- Datenmanagement (20)
- Computerunterstützte Kommunikation (15)
- Corpus linguistics (15)
- Automatische Sprachanalyse (13)
- Corpus technology (12)
- German (12)
- Metadaten (12)
- Deutsches Referenzkorpus (DeReKo) (11)
- Text Encoding Initiative (10)
- Digital Humanities (9)
- CMC (8)
- Kontrastive Linguistik (8)
- TEI (8)
- Texttechnologie (8)
- Urheberrecht (8)
- Datenbanksystem (7)
- Historische Sprachwissenschaft (7)
- Large corpora (7)
- Syntaktische Analyse (7)
- Syntax (7)
- Corpus annotation (6)
- Datenqualität (6)
- Englisch (6)
- Fremdsprachenlernen (6)
- Grammatik (6)
- Information Retrieval (6)
- Kollokation (6)
- Natürliche Sprache (6)
- Polnisch (6)
- Rumänisch (6)
- Sprachdaten (6)
- Sprachstatistik (6)
- Sprachvariante (6)
- Textkorpus (6)
- Visualisierung (6)
- YouTube (6)
- computer-mediated communication (6)
- corpus processing (6)
- metadata (6)
- Audiovisuelles Material (5)
- Automatische Spracherkennung (5)
- Corpus management (5)
- Datensatz (5)
- Institut für Deutsche Sprache <Mannheim> (5)
- Jugendsprache (5)
- Korpusanalyseplattform (KorAP) (5)
- Korpuslinguistik (5)
- National corpus (5)
- Sprachwandel (5)
- Textlinguistik (5)
- Wörterbuch (5)
- comparable corpora (5)
- corpora (5)
- corpus (5)
- web corpora (5)
- Chatten <Kommunikation> (4)
- Data Mining (4)
- Französisch (4)
- Internet (4)
- Langzeitarchivierung (4)
- Maschinelles Lernen (4)
- Metadatenmodell (4)
- Methode (4)
- Phonetik (4)
- Phraseologie (4)
- Prädikat (4)
- Semasiologie (4)
- Social Media (4)
- Suchmaschine (4)
- Text Encoding Initiative (TEI) (4)
- XML (4)
- corpus analysis (4)
- corpus annotation (4)
- corpus management (4)
- legal issues (4)
- oral corpora (4)
- reply relations (4)
- software (4)
- youth language (4)
- Abfragesprache (3)
- Benutzeroberfläche (3)
- CLARIN (3)
- CoRoLa (3)
- Computerunterstützte Lexikographie (3)
- Corpus query language (3)
- Daten (3)
- Datenschutz-Grundverordnung (3)
- Interoperabilität (3)
- Kommunikation (3)
- KorAP (3)
- Lexikografie (3)
- Morphologie <Linguistik> (3)
- Negation (3)
- Schriftsprache (3)
- Software (3)
- Sprachverarbeitung (3)
- Strukturbaum (3)
- Textsorte (3)
- Web corpora (3)
- Wikipedia (3)
- Worthäufigkeit (3)
- annotation (3)
- copyright (3)
- corpus infrastructures (3)
- interoperability (3)
- large corpora (3)
- linguistic research software (3)
- phonetics (3)
- research infrastructure (3)
- spoken German (3)
- spoken language (3)
- spoken language corpora (3)
- tokenization (3)
- treebanks (3)
- ASR (2)
- Abfrage (2)
- Althochdeutsch (2)
- Anonymisierung (2)
- Auszeichnungssprache (2)
- Automatische Textanalyse (2)
- Concurrent Markup/Overlap (2)
- Czech (2)
- DMC (2)
- Datenanalyse (2)
- Datenbank (2)
- Datenformat (2)
- Datenschutz (2)
- Digitale Kommunikation (2)
- Diskursanalyse (2)
- Dokumentation (2)
- Dortmunder Chat-Korpus (2)
- Europa (2)
- Forschungsmethode (2)
- Fremdsprachenunterricht (2)
- GDPR (2)
- Google Ngram Corpora (2)
- ISO-Norm (2)
- Informationstheorie (2)
- Infrastruktur (2)
- Interaktion (2)
- Konferenz (2)
- Kontrastive Grammatik (2)
- KorAP (Korpusanalyseplattform der nächsten Generation) (2)
- Language Variation (2)
- Lehrmittel (2)
- Leibniz-Institut für Deutsche Sprache (IDS) (2)
- Literary corpus (2)
- MTAS (2)
- Mehrsprachigkeit (2)
- Modeling (2)
- Name (2)
- NottDeuYTSch corpus (2)
- NottDeuYTSch-Korpus (2)
- Online-Datenbank (2)
- Part-of-Speech-Tagging = POS (2)
- Personenbezogene Daten (2)
- Plurizentrische Sprache (2)
- Polish (2)
- Pronomen (2)
- Recht (2)
- Rechtsfrage (2)
- Rechtsstellung (2)
- Richtlinie (2)
- Ripuarian (2)
- Russisch (2)
- Semantik (2)
- Softwarewerkzeug (2)
- Sprachgeografie (2)
- Sprachtypologie (2)
- Standardisierung (2)
- Statistischer Test (2)
- Syntagma (2)
- Technologie (2)
- Text Mining (2)
- Textanalyse (2)
- Transkription (2)
- Tschechisch (2)
- UGC (2)
- Ungarisch (2)
- Universal Dependencies (2)
- Web (2)
- Web Services (2)
- Wissenschaftliche Kooperation (2)
- Wortschatz (2)
- Zipf’s law (2)
- audiovisual data (2)
- automatic transcription (2)
- cmc corpora (2)
- computer-mediated communication (CMC) (2)
- contrastive linguistics (2)
- corpus curation (2)
- corpus linguistic methodology (2)
- corpus reusability (2)
- data quality (2)
- deduplication (2)
- dependency parsing (2)
- diachronic corpus linguistics (2)
- genre and register variation (2)
- impact assessment (2)
- information theory (2)
- infrastructure (2)
- language change (2)
- language learning (2)
- language resources (2)
- lexical richness (2)
- linguistic data (2)
- long-term archival (2)
- machine learning (2)
- morphology (2)
- n-grams (2)
- noun–pronoun ratio (2)
- null-hypothesis testing (2)
- online language (2)
- parallel corpora (2)
- parser adaptation (2)
- personal data (2)
- phraseology (2)
- pluricentric (2)
- power law (2)
- quantitative approaches (2)
- query (2)
- reference corpora (2)
- representativeness (2)
- reusability (2)
- scalability (2)
- software quality management (2)
- speech corpus (2)
- spoken language data (2)
- syntactic complexity (2)
- time series analysis (2)
- type token ratio (2)
- virtual collections (2)
- vocabulary size (2)
- word embeddings (2)
- (un)certainty (1)
- Access Control (1)
- African languages dictionaries (1)
- Aichinger, Ilse (1)
- Akademischer Grad (1)
- Akkusativ (1)
- Algorithmus (1)
- Ambiguität (1)
- Amerikanisches Englisch (1)
- Anapher <Syntax> (1)
- Angewandte Linguistik (1)
- Annotation guidelines (1)
- Annotation of causal language (1)
- Annotator Agreement (1)
- Antonym (1)
- Antwort (1)
- Antwortrelationen (1)
- Antwortstrukturen (1)
- Archiv für Gesprochenes Deutsch (AGD) (1)
- Archivierung (1)
- Assoziationsmaß (1)
- Auslassung (1)
- Authentische Ressourcen (1)
- Automatische Sprachanalyse; (1)
- Bearbeitung von Korpusanfragen (1)
- Bedeutungserweiterung (1)
- Bedienungsanleitung (1)
- Benutzerforschung (1)
- Benutzerfreundlichkeit (1)
- Bosnian (1)
- Bosnisch (1)
- Brown clustering (1)
- CLARIAH-DE (1)
- CLARIN infrastructure (1)
- CLARIN-D (1)
- CMC (International Conference on Cooperative Multimodal Communication) <2023, Mannheim> (1)
- CMC Corpora (1)
- CMC corpora (1)
- CMC corpus (1)
- CMDI metadata (1)
- COVID-19 (1)
- COVID-19 discourse (1)
- CQLF (1)
- CTS (1)
- Canonical text services (1)
- Clarin (1)
- Cluster <Datenanalyse> (1)
- Co-Reference (1)
- CoMParS (1)
- Collocations (1)
- Communicative Functions (1)
- Comparable Corpus (1)
- Comparable corpora (1)
- Computational linguistics (1)
- Computer-Mediated Communication (1)
- Computer-mediated communication (1)
- Computergestützte Lexikographie (1)
- Computerlingustik (1)
- Computeruntertützte Lexikographie (1)
- Contrastive linguistics (1)
- Conversational Feedback (1)
- Coreference (1)
- Corpora (1)
- Corpus Analysis (1)
- Corpus Comparison (1)
- Corpus Linguistics (1)
- Corpus Management (1)
- Corpus Tools (1)
- Corpus query platform (1)
- Croatian (1)
- DARIAH-DE (1)
- DRuKoLA (1)
- DSSSL (1)
- Data Architecture (1)
- Data Formats (1)
- Data Vizualization (1)
- Dateiformat (1)
- Datenbank für Gesprochenes Deutsch (DGD) (1)
- Datensammlung (1)
- Datenstruktur (1)
- Dativ (1)
- Decision Trees (1)
- Dependency Parsing (1)
- Deutsch als Fremdsprache (1)
- Digitale Sprachressourcen (1)
- Digitaler Sprachassistent (1)
- Discourse analysis (1)
- Discourse annotation (1)
- Distribution <Linguistik> (1)
- Distributional semantics (1)
- Dzongkha (1)
- EOSC (1)
- Eigentumsrecht (1)
- Einbettung <Linguistik> (1)
- Elektronische Publikation (1)
- Elektronisches Forum (1)
- Endlicher Zustandsraum (1)
- English (1)
- Enzyklopädie (1)
- Ergänzung <Linguistik> (1)
- European Reference Corpus (EuReCo) (1)
- Europäische Kommission. Digital Single Market (1)
- Evaluation (1)
- Evaluation methodologies (1)
- FAIR (1)
- FAIR Index (1)
- FAIR data (1)
- Fallstudie (1)
- Forschungs- und Lehrkorpus Gesprochenes Deutsch (FOLK) (1)
- Forschungsprojekt (1)
- Frame-Semantik (1)
- Framing-Effekt (1)
- French (1)
- French-German (1)
- Futur (1)
- GDE-V (1)
- GOLD standard (1)
- Genitive Classification (1)
- GermaNet (1)
- German Reference Corpus (DeReKo) (1)
- German language (1)
- German reference corpus (1)
- German spoken language (1)
- Geschlechterstereotyp (1)
- Gigafida 2.1 corpus (1)
- Gospel <Musik> (1)
- Grammar (1)
- Haftung (1)
- Hamlet (1)
- Hilfesystem (1)
- Historical Corpora (1)
- Historische Korpora (1)
- Historische Lexikografie (1)
- Hungarian (1)
- Häufigkeitsverteilung (1)
- ICC corpus (1)
- ICE corpus (1)
- ISO (1)
- ISO/TC 37/SC 4 (1)
- ISO/TEI (1)
- IVK-Ler corpus of German (1)
- Implementation (1)
- Indikativ (1)
- Indikator (1)
- Informationsgehalt (1)
- Informationsstruktur (1)
- Interaktionsanalyse (1)
- Interdisziplinarität (1)
- International Conference on Language Resources and Evaluation (12. : 2020 : Marseille) (1)
- International Corpus of English (1)
- Interrelated document grammars (1)
- Intertextuality (1)
- Intertextualität (1)
- Italian (1)
- Italienisch (1)
- Jugend (1)
- Jugendlicher (1)
- Kausalität (1)
- Kiezdeutsch (1)
- Klassifikation (1)
- Kognitive Linguistik (1)
- Kongress (1)
- Kongressbericht (1)
- Konjunktiv (1)
- Kontrastive Morphologie (1)
- Kontrastive Syntax (1)
- Konversationsanalyse (1)
- Korpusannotation (1)
- Korpusmanagement (1)
- Korpustechnologie (1)
- Korpusvergleich (1)
- Korrelationsanalyse (1)
- Kroatisch (1)
- Kymrisch (1)
- L2 Russian (1)
- LFG (1)
- Large Corpora (1)
- Lautschrift (1)
- Lelxikographie (1)
- Lemma (1)
- Lemmata (1)
- Levelled Study Corpus of Russian (LeStCoR) (1)
- Lexical Semantics (1)
- Lexical functional grammar (1)
- Lexicon (1)
- Lexikalisch funktionale Grammatik (1)
- Lexikalische Analyse (1)
- Lexikalische Semantik (1)
- Lexikographie (1)
- Literatur (1)
- Logit-Modell (1)
- Lower Sorbian (1)
- Lyrics <Lyrik> (1)
- MLSA (1)
- Machine Leaming (1)
- Machine translating (1)
- Markup Languages (1)
- Mehrworteinheit (1)
- Metadata (1)
- Metakommunikation (1)
- Methodik (1)
- Methodologie (1)
- Minderheitensprache (1)
- Modus (1)
- Morphem (1)
- Morphologie<Linguistik> (1)
- Morphology (1)
- Morphosyntax (1)
- Multikulturelle Gesellschaft (1)
- Multilingual Corpus (1)
- Multilingual corpora (1)
- Multilingual corpus (1)
- Multilingualismus (1)
- Multiple annotations (1)
- MySQL (1)
- Mündliche Kommunikation (1)
- N-N compound (1)
- N-gram modeling (1)
- Nachhaltigkeit (1)
- Named Entity Recognition (1)
- Natural Language Processing (NLP) (1)
- Negativer Polaritätsausdruck (1)
- Negationen (1)
- NeoRate (1)
- Neologismus (1)
- Newspaper (1)
- Nominalisierung (1)
- Nord-Sotho (1)
- Northern Sotho (1)
- NottDeuYTSch Corpus (1)
- Nutzer (1)
- Nutzungsrecht (1)
- OAuth (1)
- OTRS (1)
- Online-Wörterbuch (1)
- Onlinekommentare (1)
- Ontologie <Wissensverarbeitung> (1)
- Ontology (1)
- Open Science (1)
- Open Source (1)
- Oral history (1)
- PCFG (1)
- POS-Tagging (1)
- Paradigma (1)
- Parallel corpora (1)
- Paronym (1)
- Parser (1)
- Parsing (1)
- Parsing Systems (1)
- Part-of-Speech-Tagging (1)
- Pedi-Sprache (1)
- Performanz <Linguistik> (1)
- Phrase <Syntagma> (1)
- Polarity Shifter (1)
- Polarität (1)
- Poliqarp (1)
- Polysemie (1)
- Popmusik (1)
- Pragmatik (1)
- Privacy by Design (1)
- Proposition (1)
- Prosodie (1)
- Präposition (1)
- Psycholinguistik (1)
- QUEST (1)
- QUEST project (1)
- Quantitative Analyse (1)
- Query Languages (1)
- Query Rewriting (1)
- R <Programm> (1)
- Rechtschreibung (1)
- Rechtsschutz (1)
- Redeerwähnung (1)
- Reference Corpora (1)
- Reflexitität <Linguistik> (1)
- Register (1)
- Regressionsanalyse (1)
- Relation extraction (1)
- Repository <Informatik> (1)
- Rezeption (1)
- Romanian corpus (1)
- Rückmeldung (1)
- SALSA (1)
- SQL (1)
- Satzende (1)
- Schwedisch (1)
- Schüler (1)
- SciLogs (1)
- Segmentierung (1)
- Semantic Web (1)
- Semantic analysis (1)
- Semantic similarity (1)
- Semantische Analyse (1)
- Semantische Relation (1)
- Sentiment Analysis (1)
- Sentimentanalyse (1)
- Sepedi (1)
- Serbian (1)
- Serbisch (1)
- Service provider (1)
- Shakespeare, William (1)
- Sketch engine (1)
- Slawistik (1)
- Slovak (1)
- Slowakisch (1)
- Slowenisch (1)
- Softwareergonomie (1)
- Sorbian institute (1)
- Sorbisch (1)
- Soziale Software (1)
- Soziolinguistik (1)
- Speech Corpora (1)
- Speech Lexica (1)
- Sprachakt (1)
- Sprachgebrauch (1)
- Sprachgeschichte (1)
- Sprachkontakt (1)
- Sprachliche Minderheit (1)
- Sprachübersetzung (1)
- Sprecherwechsel (1)
- Sprichwort (1)
- Stadtmundart (1)
- Statistische Linguistik (1)
- Supervised Classification (1)
- Swedish (1)
- Synonym (1)
- TEI encoding (1)
- Tag (1)
- Tagging (1)
- Technische Infrastruktur (1)
- Testdaten (1)
- Text Classification (1)
- Text Technology (1)
- Text mining (1)
- Text retrieval (1)
- Text-to-Speech (1)
- Textbaustein (1)
- Textklassifikation (1)
- Textklassifizierung (1)
- Textverstehendes System (1)
- Thematische Relation (1)
- Token <Linguistik> (1)
- Topikalisierung (1)
- Topikmodellierung (1)
- Treebank (1)
- Treebanks (1)
- Trees/Graphs (1)
- Twitter (1)
- Twitter <Softwareplattform> (1)
- UseNet (1)
- User Generated Content (1)
- Validating (1)
- Visualization (1)
- Vorhersagbarkeit (1)
- Walisisch (1)
- Web corpus (1)
- Web spam (1)
- Weblog (1)
- Welsh (1)
- Wikipedia talk pages (1)
- Wissenschaft (1)
- Wissensrepräsentation (1)
- Word associations (1)
- World Wide Web (1)
- Wortbildung (1)
- Wortfeld (1)
- Wortlänge (1)
- Wortverbindung (1)
- Wörterbücher afrikanischer Sprachen (1)
- XQuery (1)
- XQuery Full Text (1)
- XSLT (1)
- YouTube comments (1)
- Zeitreihenanalyse (1)
- Zeitschrift (1)
- Zeitung (1)
- Zertifizierung (1)
- Zipf–Mandelbrot law (1)
- Zustandsverb (1)
- Zuverlässigkeit (1)
- Zweitspracherwerb (1)
- acquisition (1)
- annotated corpora (1)
- annotation guidelines (1)
- annotation schema (1)
- annotation tool (1)
- anonymisation (1)
- anonymization (1)
- anotación de corpus (1)
- application (1)
- audio-visual data (1)
- authentic language (1)
- authentic materials (1)
- big data (1)
- blog corpus (1)
- category detection (1)
- causal tagger (1)
- centres (1)
- clause union (1)
- clitic climbing (1)
- collocation analysis (1)
- collocations (1)
- colonial language contact (1)
- comparable corpus (1)
- compatibility (1)
- compound family (1)
- compound interpretation (1)
- compression (1)
- comunicación mediada por computadora (CMC) (1)
- corpus linguistics (1)
- corpus CMC (1)
- corpus access (1)
- corpus analysis tools (1)
- corpus architecture (1)
- corpus compilation (1)
- corpus creation (1)
- corpus driven approach (1)
- corpus frequency (1)
- corpus management systems (1)
- corpus query processing (1)
- corpus querying (1)
- corpus retrieval (1)
- corpus search engine (1)
- corpus search platform (1)
- corpus size (1)
- corpus storage (1)
- corpus-based lexicography (1)
- corpus-lexicographic tool (1)
- cross-linguistic data (1)
- curation (1)
- data (1)
- data curation (1)
- data dissemination (1)
- data provision (1)
- data referencing (1)
- data sustainability (1)
- database (1)
- database systems (1)
- dataset (1)
- didactic corpus (1)
- digital collocation database (1)
- digital communication (1)
- digital humanities (1)
- digitally-mediated communication (1)
- discourse dictionary (1)
- discovering collocations in corpora (1)
- distributional semantics (1)
- domain-specific solutions (1)
- e-lexicography (1)
- eLexiko (1)
- electronic lexicography (1)
- exploration of CMDI metadata (1)
- extensibility (1)
- finite state (1)
- finite state tokenization (1)
- format migration (1)
- framing (1)
- gender equality (1)
- gender linguistics (1)
- gesprochene Sprache (1)
- grammar acquisistion (1)
- grammar and syntax (1)
- help desk (1)
- historical corpora (1)
- historical lexicography (1)
- historische Lexikographie (1)
- history of science (1)
- impact categories (1)
- information density (1)
- information retrieval (1)
- infrastructure technology (1)
- infrastructures and architectures (1)
- intensification (1)
- interactional linguistics (1)
- international comparable corpus (1)
- internetbasierte Kommunikation (1)
- internetbasierte Kommunikation (IBK) (1)
- language (1)
- language corpora (1)
- language data (1)
- language documentation (1)
- language fixedness (1)
- language modelling (1)
- language models (1)
- language portal (1)
- language variation (1)
- large corpus data (1)
- large-scale corpora (1)
- law (1)
- learner corpora (1)
- learner corpus (1)
- learner corpus of adolescent (1)
- lexical data (1)
- lexical information system (1)
- lexical level (1)
- lexical-functional grammar (1)
- lexicography (1)
- lexicography equality (1)
- lexis (1)
- liability (1)
- linguistique interactionnelle (1)
- log file analysis (1)
- logistic regression (1)
- longitudinal study (1)
- machine learning methods (1)
- metacommunication (1)
- metadata analysis (1)
- metadata standards (1)
- microservices (1)
- minority language (1)
- morphological level (1)
- multi-layer corpora (1)
- multi-turn conversations (1)
- multidisciplinarity (1)
- multilingual corpora (1)
- multilingualism (1)
- multilinguality (1)
- multiword expressions (1)
- mysql (1)
- naming (1)
- national corpora (1)
- native speech (1)
- natural language processing (1)
- negation (1)
- negation Raising (1)
- neologism detection (1)
- non-native speech (1)
- oral corpus platform (1)
- oral history corpora (1)
- oral language (1)
- orthography (1)
- parser evaluation (1)
- parsing (1)
- part-of-speech (POS) (1)
- part-of-speech tagging (1)
- pattern-based lexicography (1)
- patterns (1)
- performativity (1)
- persistent identifiers (1)
- phonetic databases (1)
- polysemy (1)
- pop lyrics (1)
- predictive approach (1)
- productivity (1)
- project report (1)
- propositional argument (1)
- proverb (1)
- quality checking (1)
- quality evaluation (1)
- query building (1)
- query language (1)
- query languages (1)
- reference corpus (1)
- referencing strategies (1)
- relaciones de respuesta (1)
- relational database (1)
- repository (1)
- research data (1)
- research infrastructures (1)
- research into dictionary use (1)
- research reports (1)
- research tools (1)
- resources (1)
- reusability of research data (1)
- search engine (1)
- second language acquisition (1)
- semantic extension (1)
- semiotics (1)
- sentence boundary detection (1)
- sentiment (1)
- sentiment analysis (1)
- sharing data (1)
- social media interaction (1)
- social relevance (1)
- sociolinguistics (1)
- space-delimited languages (1)
- specialist corpora (1)
- speech thought writing representation (1)
- spoken corpora (1)
- spoken language transcripts (1)
- spoken vs. written (1)
- stance (1)
- standard (1)
- standardization (1)
- standards (1)
- standoff annotation (1)
- statistical significance (1)
- subjunctive (1)
- sustainability (1)
- syntactic competence (1)
- syntactical level (1)
- syntax (1)
- tagging (1)
- text classification (1)
- text complexity (1)
- text corpus (1)
- time-series analysis (1)
- topic models (1)
- transcription (1)
- type frequency (1)
- under-resourced language varieties (1)
- uniform information density (1)
- urban youth language (1)
- usability (1)
- usage-based model (1)
- user interface (1)
- user interface design (1)
- user support (1)
- user survey (1)
- very large corpora (1)
- virtual corpus (1)
- visualisation (1)
- vocabulary growth (1)
- vocabulary of quotation expressions (1)
- web crawling (1)
- web data (1)
- web service (1)
- word formation in German (1)
- word frequency distribution (1)
- youth (1)
- Ähnlichkeitssuche (1)
- Öffentlichkeit (1)
- Österreich (1)
- Übersetzungswissenschaft (1)
Publicationstate
- Veröffentlichungsversion (195)
- Zweitveröffentlichung (40)
- Postprint (13)
Reviewstate
Publisher
- European Language Resources Association (24)
- European Language Resources Association (ELRA) (24)
- Institut für Deutsche Sprache (20)
- Leibniz-Institut für Deutsche Sprache (11)
- Linköping University Electronic Press (10)
- Narr (10)
- CLARIN (8)
- Leibniz-Institut für Deutsche Sprache (IDS) (6)
- de Gruyter (6)
- Association for Computational Linguistics (5)
Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus
(2021)
Since the introduction of large language models in Natural Language Processing, large raw corpora have played a crucial role in Computational Linguistics. However, most of these large raw corpora are either available only for English or not available to the general public due to copyright issues. Nevertheless, there are some examples of freely available multilingual corpora for training Deep Learning NLP models, such as the OSCAR and Paracrawl corpora. However, they have quality issues, especially for low-resource languages. Moreover, recreating or updating these corpora is very complex. In this work, we try to reproduce and improve the goclassy pipeline used to create the OSCAR corpus. We propose a new pipeline that is faster, modular, parameterizable, and well documented. We use it to create a corpus similar to OSCAR but larger and based on recent data. Also, unlike OSCAR, the metadata information is at the document level. We release our pipeline under an open source license and publish the corpus under a research-only license.
This paper presents the QUEST project and describes concepts and tools that are being developed within its framework. The goal of the project is to establish quality criteria and curation criteria for annotated audiovisual language data. Building on existing resources developed by the participating institutions earlier, QUEST develops tools that could be used to facilitate and verify adherence to these criteria. An important focus of the project is making these tools accessible for researchers without substantial technical background and helping them produce high-quality data. The main tools we intend to provide are the depositors’ questionnaire and automatic quality assurance, both developed as web applications. They are accompanied by a Knowledge base, which will contain recommendations and descriptions of best practices established in the course of the project. Conceptually, we split linguistic data into three resource classes (data deposits, collections and corpora). The class of a resource defines the strictness of the quality assurance it should undergo. This division is introduced so that too strict quality criteria do not prevent researchers from depositing their data.
This paper presents the QUEST project and describes concepts and tools that are being developed within its framework. The goal of the project is to establish quality criteria and curation criteria for annotated audiovisual language data. Building on existing resources developed by the participating institutions earlier, QUEST also develops tools that could be used to facilitate and verify adherence to these criteria. An important focus of the project is making these tools accessible for researchers without substantial technical background and helping them produce high-quality data. The main tools we intend to provide are a questionnaire and automatic quality assurance for depositors of language resources, both developed as web applications. They are accompanied by a knowledge base, which will contain recommendations and descriptions of best practices established in the course of the project. Conceptually, we consider three main data maturity levels in order to decide on a suitable level of strictness of the quality assurance. This division has been introduced to avoid that a set of ideal quality criteria prevent researchers from depositing or even assessing their (legacy) data. The tools described in the paper are work in progress and are expected to be released by the end of the QUEST project in 2022.
The CMDI Explorer
(2020)
We present the CMDI Explorer, a tool that empowers users to easily explore the contents of complex CMDI records and to process selected parts of them with little effort. The tool allows users, for instance, to analyse virtual collections represented by CMDI records, and to send collection items to other CLARIN services such as the Switchboard for subsequent processing. The CMDI Explorer hence adds functionality that many users felt was lacking from the CLARIN tool space.
CMDI Explorer
(2021)
We present CMDI Explorer, a tool that empowers users to easily explore the contents of complex CMDI records and to process selected parts of them with little effort. The tool allows users, for instance, to analyse virtual collections represented by CMDI records, and to send collection items to other CLARIN services such as the Switchboard for subsequent processing. CMDI Explorer hence adds functionality that many users felt was lacking from the CLARIN tool space.
This paper addresses long-term archival for large corpora. Three aspects specific to language resources are focused, namely (1) the removal of resources for legal reasons, (2) versioning of (unchanged) objects in constantly growing resources, especially where objects can be part of multiple releases but also part of different collections, and (3) the conversion of data to new formats for digital preservation. It is motivated why language resources may have to be changed, and why formats may need to be converted. As a solution, the use of an intermediate proxy object called a signpost is suggested. The approach will be exemplified with respect to the corpora of the Leibniz Institute for the German Language in Mannheim, namely the German Reference Corpus (DeReKo) and the Archive for Spoken German (AGD).
Signposts for CLARIN
(2020)
An implementation of CMDI-based signposts and its use is presented in this paper. Arnold et al. 2020 present Signposts as a solution to challenges in long-term preservation of corpora, especially corpora that are continuously extended and subject to modification, e.g., due to legal injunctions, but also may overlap with respect to constituents, and may be subject to migrations to new data formats. We describe the contribution Signposts can make to the CLARIN infrastructure and document the design for the CMDI profile.
Signposts for CLARIN
(2021)
An implementation of CMDI-based signposts and its use is presented in this paper. Arnold, Fisseni et al. (2020) present signposts as a solution to challenges in long-term preservation of corpora. Though applicable to digital resources in general, we focus on corpora, especially those that are continuously extended or subject to modification, e.g., due to legal injunctions, but also may overlap with respect to constituents, and may be subject to migrations to new data formats. We describe the contribution signposts can make to the CLARIN infrastructure, notably virtual collections, and document the design for the CMDI profile.
In many European languages, propositional arguments (PAs) can be realized as different types of structures. Cross-linguistically, complex structures with PAs show a systematic correlation between the strength of the semantic bond and the syntactic union (cf. Givón 2001; Wurmbrand/Lohninger 2023). Also, different languages show similarities with respect to the (lexical) licensing of different PAs (cf. Noonan 1985; Givón 2001; Cristofaro 2003 on different predicate types). However, on a more fine-grained level, a variation across languages can be observed both with respect to the syntactic-semantic properties of PAs as well as to their licensing and usage. This presentation takes a multi-contrastive view of different types of PAs as syntactic subjects and objects by looking at five European languages: EN, DE, IT, PL and HU. Our goal is to identify the parameters of variation in the clausal domain with PAs and by this to contribute to a better understanding of the individual language systems on the one hand and the nature of the linguistic variation in the clausal domain on the other hand. Phenomena and Methodology: We investigate the following types of PAs: direct object (DO) clauses (1), prepositional object (PO) clauses (2), subject clauses (3), and nominalizations (4, 5). Additionally, we discuss clause union phenomena (6, 7). The analyzed parameters include among others finiteness, linear position of the PA, (non) presence of a correlative element, (non) presence of a complementizer, lexical-semantic class of the embedding verb. The phenomena are analyzed based on corpus data (using mono- and multilingual corpora), experimental data (acceptability judgement surveys) or introspective data.
We present an approach to an aspect of managing complex access scenarios to large and heterogeneous corpora that involves handling user queries that, intentionally or due to the complexity of the queried resource, target texts or annotations outside of the given user’s permissions. We first outline the overall architecture of the corpus analysis platform KorAP, devoting some attention to the way in which it handles multiple query languages, by implementing ISO CQLF (Corpus Query Lingua Franca), which in turn constitutes a component crucial for the functionality discussed here. Next, we look at query rewriting as it is used by KorAP and zoom in on one kind of this procedure, namely the rewriting of queries that is forced by data access restrictions.