Refine
Document Type
- Part of a Book (7)
- Article (1)
Has Fulltext
- yes (8)
Keywords
- Annotation (3)
- Computerlinguistik (3)
- Diskursanalyse (2)
- Korpus <Linguistik> (2)
- XML (2)
- XML (Extensible Markup Language) (2)
- Annotations (1)
- Automatische Sprachanalyse (1)
- Computational linguistics (1)
- Concurrent markup (1)
Publicationstate
- Postprint (8) (remove)
Reviewstate
Publisher
- Springer (4)
- Benjamins (1)
- Oxford University Press (1)
- Springer-Verlag (1)
- Wilhelm Fink (1)
Dieses Kapitel gibt einen Überblick über Korpora internetbasierter Kommunikation, die als digitale Ressourcen frei zur Verfügung stehen und für eigene linguistische Forschungsarbeiten genutzt werden können. In Abschnitt 1 erläutern wir korpuslinguistische Basiskonzepte, die für die Arbeit mit Korpora internetbasierter Kommunikation benötigt werden, und präzisieren die Sprachgebrauchsdomäne Internetbasierte Kommunikation, die den Gegenstand des hier beschriebenen Ressourcentyps bildet. Abschnitt 2 gibt einen Überblick zu existierenden Korpusressourcen für das Deutsche und stellt ausgewählte Korpora zu weiteren europäischen Sprachen vor. In Abschnitt 3 geben wir abschließend einen kurzen Einblick in aktuelle Forschungsfelder, die sich im Bereich der Korpuslinguistik und Sprachtechnologie in Bezug auf den Aufbau und die Aufbereitung von Korpora internetbasierter Kommunikation stellen.
This study examines what kind of cues and constraints for discourse interpretation can be derived from the logical and generic document structure of complex texts by the example of scientific journal articles. We performed statistical analysis on a corpus of scientific articles annotated on different annotations layers within the framework of XML-based multi-layer annotation. We introduce different discourse segment types that constrain the textual domains in which to identify rhetorical relation spans, and we show how a canonical sequence of text type structure categories is derived from the corpus annotations. Finally, we demonstrate how and which text type structure categories assigned to complex discourse segments of the type “block” statistically constrain the occurrence of rhetorical relation types.
Discourse parsing of complex text types such as scientific research articles requires the analysis of an input document on linguistic and structural levels that go beyond traditionally employed lexical discourse markers. This chapter describes a text-technological approach to discourse parsing. Discourse parsing with the aim of providing a discourse structure is seen as the addition of a new annotation layer for input documents marked up on several linguistic annotation levels. The discourse parser generates discourse structures according to the Rhetorical Structure Theory. An overview of the knowledge sources and components for parsing scientific joumal articles is given. The parser’s core consists of cascaded applications of the GAP, a Generic Annotation Parser. Details of the chart parsing algorithm are provided, as well as a short evaluation in terms of comparisons with reference annotations from our corpus and with recently developed Systems with a similar task.
This chapter addresses the requirements and linguistic foundations of automatic relational discourse analysis of complex text types such as scientific journal articles. It is argued that besides lexical and grammatical discourse markers, which have traditionally been employed in discourse parsing, cues derived from the logical and generical document structure and the thematic structure of a text must be taken into account. An approach to modelling such types of linguistic information in terms of XML-based multi-layer annotations and to a text-technological representation of additional knowledge sources is presented. By means of quantitative and qualitative corpus analyses, cues and constraints for automatic discourse analysis can be derived. Furthermore, the proposed representations are used as the input sources for discourse parsing. A short overview of the projected parsing architecture is given.
Different Views on Markup
(2010)
In this chapter, two different ways of grouping information represented in document markup are examined: annotation levels, referring to conceptual levels of description, and annotation layers, referring to the technical realisation of markup using e.g. document grammars. In many current XML annotation projects, multiple levels are integrated into one layer, often leading to the problem of having to deal with overlapping hierarchies. As a solution, we propose a framework for XML-based multiple, independent XML annotation layers for one text, based on an abstract representation of XML documents with logical predicates. Two realisations of the abstract representation are presented, a Prolog fact base format together with an application architecture, and a specification for XML native databases. We conclude with a discussion of projects that have currently adopted this framework.
An approach to the unification of XML (Extensible Markup Language) documents with identical textual content and concurrent markup in the framework of XML-based multi-layer annotation is introduced. A Prolog program allows the possible relationships between element instances on two annotation layers that share PCDATA to be explored and also the computing of a target node hierarchy for a well-formed, merged XML document. Special attention is paid to identity conflicts between element instances, for which a default solution that takes into account metarelations that hold between element types on the different annotation layers is provided. In addition, rules can be specified by a user to prescribe how identity conflicts should be solved for certain element types.
Researchers in many disciplines, sometimes working in close cooperation, have been concerned with modeling textual data in order to account for texts as the prime information unit of written communication. The list of disciplines includes computer science and linguistics as well as more specialized disciplines like computational linguistics and text technology. What many of these efforts have in common is the aim to model textual data by means of abstract data types or data structures that support at least the semi-automatic processing of texts in any area of written communication.
Discourse segmentation is the division of a text into minimal discourse segments, which form the leaves in the trees that are used to represent discourse structures. A definition of elementary discourse segments in German is provided by adapting widely used segmentation principles for English minimal units, while considering punctuation, morphology, sytax, and aspects of the logical document structure of a complex text type, namely scientific articles. The algorithm and implementation of a discourse segmenter based on these principles is presented, as well an evaluation of test runs.