Refine
Document Type
- Article (1)
- Part of a Book (1)
Has Fulltext
- yes (2)
Is part of the Bibliography
- no (2)
Keywords
- XML (2) (remove)
Publicationstate
- Postprint (2) (remove)
Reviewstate
- Peer-Review (1)
The paper investigates the evolution of document grammars from a linguistic point of view. Document grammars have been developed in the past decades in order to formalize knowledge on the structure of textual information. A well-known instance of a document grammar is the »Document Type Definition« (DTD) as part of the Extensible Markup Language (XML). DTDs allow to define so-called tree grammars that constrain the application of tag-sets in the process of annotation of a document. In an XML-based document workflow, DTDs play a crucial role for validation and transforming huge amounts of texts in standardized data formats. An interesting point in the development of XML DTDs is the fact that the restriction of the formal expressiveness paved the way to understand the formal properties of document grammars better and to develop more a powerful version like XML Schema recently. In this sense, the simplicity of the original approach, resulting from the necessary restriction of previous approaches, yielded new complexity on formally understood grounds.
This chapter addresses the requirements and linguistic foundations of automatic relational discourse analysis of complex text types such as scientific journal articles. It is argued that besides lexical and grammatical discourse markers, which have traditionally been employed in discourse parsing, cues derived from the logical and generical document structure and the thematic structure of a text must be taken into account. An approach to modelling such types of linguistic information in terms of XML-based multi-layer annotations and to a text-technological representation of additional knowledge sources is presented. By means of quantitative and qualitative corpus analyses, cues and constraints for automatic discourse analysis can be derived. Furthermore, the proposed representations are used as the input sources for discourse parsing. A short overview of the projected parsing architecture is given.