Refine
Year of publication
- 2019 (39) (remove)
Document Type
- Conference Proceeding (39) (remove)
Has Fulltext
- yes (39)
Keywords
- Korpus <Linguistik> (17)
- Deutsch (12)
- Automatische Sprachanalyse (11)
- Gesprochene Sprache (7)
- corpus linguistics (7)
- Computerlinguistik (6)
- corpus processing (5)
- Annotation (3)
- Natürliche Sprache (3)
- Text Encoding Initiative (3)
Publicationstate
- Veröffentlichungsversion (30)
- Zweitveröffentlichung (7)
- Postprint (3)
Reviewstate
- Peer-Review (36)
- (Verlags)-Lektorat (1)
- Peer review (1)
Publisher
- German Society for Computational Linguistics & Language Technology und Friedrich-Alexander-Universität Erlangen-Nürnberg (7)
- Leibniz-Institut für Deutsche Sprache (6)
- Lexical Computing CZ s.r.o. (5)
- Zenodo (3)
- LiU Electronic Press (2)
- Springer (2)
- The Association for Computational Linguistics (2)
- Association for Computational Linguistics (1)
- CLARIN (1)
- Cergy-Pontoise University, France (1)
Der Beitrag beschreibt ein mehrfach annotiertes Korpus deutschsprachiger Songtexte als Datenbasis für interdisziplinäre Untersuchungsszenarien. Die Ressource erlaubt empirisch begründete Analysen sprachlicher Phänomene, systemischstruktureller Wechselbeziehungen und Tendenzen in den Texten moderner Popmusik. Vorgestellt werden Design und Annotationen des in thematische und autorenspezifische Archive stratifizierten Korpus sowie deskriptive Statistiken am Beispiel des Udo-Lindenberg-Archivs.
Статтю присвячено дослідженню комунікативних невдач у мовленнєвому жанрі відеоінтерв’ю крізь призму української національної ідентичності. Визначено тематику, типи і жанрово-мовну специфіку українського відеоінтерв’ю як зразка діалогічного мовлення. Встановлено специфіку комунікативних невдач у цьому жанрі (зі спортсменами, політиками і культурними діячами) з огляду на позиції комунікантів, структурні рівні досліджуваного жанру та максими спілкування.
This paper reports on the latest developments of the European Reference Corpus EuReCo and the German Reference Corpus in relation to three of the most important CMLC topics: interoperability, collaboration on corpus infrastructure building, and legal issues. Concerning interoperability, we present new ways to access DeReKo via KorAP on the API and on the plugin level. In addition we report about advancements in the EuReCo- and ICC-initiatives with the provision of comparable corpora, and about recent problems with license acquisitions and our solution approaches using an indemnification clause and model licenses that include scientific exploitation.
In this paper, we present WebAnno-MM, an extension of the popular web-based annotation tool WebAnno, which is designed for the linguistic annotation of transcribed spoken data with time aligned media files. Several new features have been implemented for our current use case: a novel teaching method based on pair-wise manual annotation of transcribed video data and systematic comparison of agreement between students. To enable the annotation of transcribed spoken language data, apart from technical and data model related challenges, WebAnno-MM offers an additional view to data: a (musical) score view for the inspection of parallel utterances, which is relevant for various methodological research questions regarding the analysis of interactions of spoken content.
In this paper, we describe a data processing pipeline used for annotated spoken corpora of Uralic languages created in the INEL (Indigenous Northern Eurasian Languages) project. With this processing pipeline we convert the data into a loss-less standard format (ISO/TEI) for long-term preservation while simultaneously enabling a powerful search in this version of the data. For each corpus, the input we are working with is a set of files in EXMARaLDA XML format, which contain transcriptions, multimedia alignment, morpheme segmentation and other kinds of annotation. The first step of processing is the conversion of the data into a certain subset of TEI following the ISO standard ’Transcription of spoken language’ with the help of an XSL transformation. The primary purpose of this step is to obtain a representation of our data in a standard format, which will ensure its long-term accessibility. The second step is the conversion of the ISO/TEI files to a JSON format used by the “Tsakorpus” search platform. This step allows us to make the corpora available through a web-based search interface. As an addition, the existence of such a converter allows other spoken corpora with ISO/TEI annotation to be made accessible online in the future.
We report on a new project building a Natural Language Processing resource for Zulu by making use of resources already available. Combining tagging results with the results of morphological analysis semi-automatically, we expect to reduce the amount of manual work when generating a finely-grained gold standard corpus usable for training a tagger. From the tagged corpus, we plan to extract verb-argument pairs with the aim of compiling a verb valency lexicon for Zulu.
As the Web ought to be considered as a series of sources rather than as a source in itself, a problem facing corpus construction resides in meta-information and categorization. In addition, we need focused data to shed light on particular subfields of the digital public sphere. Blogs are relevant to that end, especially if the resulting web texts can be extracted along with metadata and made available in coherent and clearly describable collections.
Nearly all of the very large corpora of English are “static”, which allows a wide range of one-time, pre-processed data, such as collocates. The challenge comes with large “dynamic” corpora, which are updated regularly, and where preprocessing is much more difficult. This paper provides an overview of the NOW corpus (News on the Web), which is currently 8.2 billion words in size, and which grows by about 170 million words each month. We discuss the architecture of NOW, and provide many examples that show how data from NOW can (uniquely) be extracted to look at a wide range of ongoing changes in English.
Content
1 Predicting learner knowledge of individual words using machine learning
Drilon Avdiu, Vanessa Bui, Klára Ptacinová Klimci´ková
2 Automatic Generation and Semantic Grading of Esperanto Sentences in a Teaching Context
Eckhard Bick
3 Toward automatic improvement of language produced by non-native language learners
Mathias Creutz, Eetu Sjöblom
4 Linguistic features and proficiency classification in L2 Spanish and L2 Portuguese
Iria del Ri´o
5 Integrating large-scale web data and curated corpus data in a search engine supporting German literacy education
Sabrina Dittrich, Zarah Weiss, Hannes Schröter, Detmar Meurers
6 Formalism for a language agnostic language learning game and productive grid generation
Sylvain Hatier, Arnaud Bey, Mathieu Loiseau
7 Understanding Vocabulary Growth Through An Adaptive Language Learning System
Elma Kerz, Andreas Burgdorf, Daniel Wiechmann, Stefan Meeger,Yu Qiao, Christian Kohlschein, Tobias Meisen
8 Summarization Evaluation meets Short-Answer Grading
Margot Mieskes, Ulrike Padó
9 Experiments on Non-native Speech Assessment and its Consistency
Ziwei Zhou, Sowmya Vajjala, Seyed Vahid Mirnezami
10 The Impact of Spelling Correction and Task Context on Short Answer Assessment for Intelligent Tutoring Systems
Ramon Ziai, Florian Nuxoll, Kordula De Kuthy, Björn Rudzewitz, Detmar Meurers