Volltext-Downloads (blau) und Frontdoor-Views (grau)

Text type structure and logical document structure

  • Most research on automated categorization of documents has concentrated on the assignment of one or many categories to a whole text. However, new applications, e.g. in the area of the Semantic Web, require a richer and more fine-grained annotation of documents, such as detailed thematic information about the parts of a document. Hence we investigate the automatic categorization of text segments of scientific articles with XML markup into 16 topic types from a text type structure schema. A corpus of 47 linguistic articles was provided with XML markup on different annotation layers representing text type structure, logical document structure, and grammatical categories. Six different feature extraction strategies were applied to this corpus and combined in various parametrizations in different classifiers. The aim was to explore the contribution of each type of information, in particular the logical structure features, to the classification accuracy. The results suggest that some of the topic types of our hierarchy are successfully learnable, while the features from the logical structure layer had no particular impact on the results.

Export metadata

Additional Services

Share in Twitter Search Google Scholar


Author:Hagen Langer, Harald LüngenGND, Petra Saskia Bayerl
Parent Title (English):Proceedings of the ACL-workshop on discource annotation
Document Type:Conference Proceeding
Year of first Publication:2004
GND Keyword:Computerlinguistik ; Texttypus
DDC classes:400 Sprache / 410 Linguistik / 410 Linguistik
Open Access?:ja
Licence (German):License LogoUrheberrechtlich geschützt