Refine
Year of publication
Document Type
- Article (7)
- Part of a Book (5)
- Conference Proceeding (2)
- Working Paper (1)
Language
- German (15) (remove)
Has Fulltext
- yes (15)
Is part of the Bibliography
- no (15) (remove)
Keywords
- Korpus <Linguistik> (9)
- Deutsch (7)
- Computerlinguistik (4)
- Deutsches Referenzkorpus (DeReKo) (3)
- Methode (3)
- Neologismus (3)
- Institut für Deutsche Sprache <Mannheim> (2)
- Studium (2)
- Annotation (1)
- Automatische Sprachanalyse (1)
Publicationstate
Reviewstate
- (Verlags)-Lektorat (7)
- Peer-Review (2)
- Verlagslektorat (1)
Publisher
- Institut für Deutsche Sprache (4)
- GSCL (2)
- de Gruyter (2)
- BBAW (1)
- Campus (1)
- Universität Bielefeld (1)
- nisaba (1)
Wikipedia is a valuable resource, useful as a lingustic corpus or a dataset for many kinds of research. We built corpora from Wikipedia articles and talk pages in the I5 format, a TEI customisation used in the German Reference Corpus (Deutsches Referenzkorpus - DeReKo). Our approach is a two-stage conversion combining parsing using the Sweble parser, and transformation using XSLT stylesheets. The conversion approach is able to successfully generate rich and valid corpora regardless of languages. We also introduce a method to segment user contributions in talk pages into postings.
Im Teilprojekt CI “SemDok” der DFG-Forschergruppe Texttechnologische Informationsmodellierung wurde ein Textparser für Diskursstrukturen wissenschaftlicher Zeitschriftenartikel nach der Rhetorical Structure Theory entwickelt. Die wesentlichen konzeptuellen und technischen Merkmale des Chart-Parsers und die sich daraus ergebenden Parametrisierungsmöglichkeiten für Parsing-Experimente werden beschrieben. Zudem wird HPVtz., ein Tool für die Visualisierung von Parsing-Ergebnissen (RST-Bäume in einer XML-Anwendung) und die Navigation in ihnen, vorgestellt.
Editorial
(2011)
Linguistische Annotationen für die Analyse von Gliederungsstrukturen wissenschaftlicher Texte
(2012)
This paper describes an approach to modelling a general-language wordnet, GermaNet, and a domain-specific wordnet, TermNet, in the web ontology language OWL. While the modelling process for GermaNet adopts relevant recommendations with respect to the English Princeton WordNet, for Term-Net an alternative modelling concept is developed that considers the special characteristics of domain-specific terminologies. We present a proposal for linking a general-language wordnet and a terminological wordnet within the framework of OWL and on this basis discuss problems and alternative modelling approaches.
Most research on automated categorization of documents has concentrated on the assignment of one or many categories to a whole text. However, new applications, e.g. in the area of the Semantic Web, require a richer and more fine-grained annotation of documents, such as detailed thematic information about the parts of a document. Hence we investigate the automatic categorization of text segments of scientific articles with XML markup into 16 topic types from a text type structure schema. A corpus of 47 linguistic articles was provided with XML markup on different annotation layers representing text type structure, logical document structure, and grammatical categories. Six different feature extraction strategies were applied to this corpus and combined in various parametrizations in different classifiers. The aim was to explore the contribution of each type of information, in particular the logical structure features, to the classification accuracy. The results suggest that some of the topic types of our hierarchy are successfully learnable, while the features from the logical structure layer had no particular impact on the results.