410 Linguistik
Refine
Year of publication
Document Type
- Part of a Book (12)
- Conference Proceeding (8)
- Article (3)
Has Fulltext
- yes (23)
Keywords
- Korpus <Linguistik> (10)
- Computerlinguistik (8)
- Annotation (7)
- Deutsch (5)
- XML (5)
- Auszeichnungssprache (3)
- Methode (3)
- Neologismus (3)
- Automatische Sprachanalyse (2)
- Deutsches Referenzkorpus (DeReKo) (2)
Publicationstate
- Veröffentlichungsversion (15)
- Postprint (6)
- (Verlags)-Lektorat (1)
- Preprint (1)
Reviewstate
- (Verlags)-Lektorat (17)
- Peer-Review (3)
Publisher
- Springer (5)
- Institut für Deutsche Sprache (3)
- GSCL (2)
- German Society for Computational Linguistics & Language Technology (GSCL) (2)
- ACM (1)
- BBAW (1)
- Campus (1)
- E-MELD (1)
- European Language Resources Association (ELRA) (1)
- Narr (1)
Usenet is a large online resource containing user-generated messages (news articles) organised in discussion groups (newsgroups) which deal with a wide variety of different topics. We describe the download, conversion, and annotation of a comprehensive German news corpus for integration in DeReKo, the German Reference Corpus hosted at the Institut für Deutsche Sprache in Mannheim.
Discourse segmentation is the division of a text into minimal discourse segments, which form the leaves in the trees that are used to represent discourse structures. A definition of elementary discourse segments in German is provided by adapting widely used segmentation principles for English minimal units, while considering punctuation, morphology, sytax, and aspects of the logical document structure of a complex text type, namely scientific articles. The algorithm and implementation of a discourse segmenter based on these principles is presented, as well an evaluation of test runs.