Refine
Year of publication
Document Type
- Conference Proceeding (11)
- Part of a Book (8)
- Article (3)
Language
- English (22) (remove)
Has Fulltext
- yes (22)
Keywords
- Diskursanalyse (5)
- Korpus <Linguistik> (5)
- Ontologie <Wissensverarbeitung> (4)
- Digital Humanities (3)
- Parser (3)
- Annotation (2)
- Computerlinguistik (2)
- Computerunterstützte Kommunikation (2)
- Deutsch (2)
- Gefangenenliteratur (2)
Publicationstate
- Zweitveröffentlichung (12)
- Postprint (5)
- Veröffentlichungsversion (4)
Reviewstate
- (Verlags)-Lektorat (15)
- Peer-Review (3)
Publisher
- Springer (3)
- ICCC Press (2)
- de Gruyter (2)
- ACM (1)
- Benjamins (1)
- EPFL/UNIL (1)
- Foi-Commerce (1)
- Foris (1)
- Gesellschaft für Linguistische Datenverarbeitung (1)
- Institut für Kognitionswissenschaft Universität Osnabrück (1)
From Open Source to Open Information. Collaborative Methods in Creating XML-based Markup Languages
(2000)
The administration of electronic publication in the Information Era congregates old and new problems, especially those related with Information Retrieval and Automatic Knowledge Extraction. This article presents an Information Retrieval System that uses Natural Language Processing and Ontology to index collection’s texts. We describe a system that constructs a domain specific ontology, starting from the syntactic and semantic analyses of the texts that compose the collection. First the texts are tokenized, then a robust syntactic analysis is made, subsequently the semantic analysis is accomplished in conformity with a metalanguage of knowledge representation, based on a basic ontology composed of 47 classes. The ontology, automatically extracted, generates richer domain specific knowledge. It propitiates, through its semantic net, the right conditions for the user to find with larger efficiency and agility the terms adapted for the consultation to the texts. A prototype of this system was built and used for the indexation of a collection of 221 electronic texts of Information Science written in Portuguese from Brazil. Instead of being based in statistical theories, we propose a robust Information Retrieval System that uses cognitive theories, allowing a larger efficiency in the answer to the users queries.
A text parsing component designed to be part of a system that assists students in academic reading an writing is presented. The parser can automatically add a relational discourse structure annotation to a scientific article that a user wants to explore. The discourse structure employed is defined in an XML format and is based the Rhetorical Structure Theory. The architecture of the parser comprises pre-processing components which provide an input text with XML annotations on different linguistic and structural layers. In the first version these are syntactic tagging, lexical discourse marker tagging, logical document structure, and segmentation into elementary discourse segments. The algorithm is based on the shift-reduce parser by Marcu (2000) and is controlled by reduce operations that are constrained by linguistic conditions derived from an XML-encoded discourse marker lexicon. The constraints are formulated over multiple annotation layers of the same text.
Discourse segmentation is the division of a text into minimal discourse segments, which form the leaves in the trees that are used to represent discourse structures. A definition of elementary discourse segments in German is provided by adapting widely used segmentation principles for English minimal units, while considering punctuation, morphology, sytax, and aspects of the logical document structure of a complex text type, namely scientific articles. The algorithm and implementation of a discourse segmenter based on these principles is presented, as well an evaluation of test runs.
In the project SemDok (Generic document structures in linearly organised texts) funded by the German Research Foundation DFG, a discourse parser for a complex type (scientific articles by example), is being developed. Discourse parsing (henceforth DP) according to the Rhetorical Structure Theory (RST) (Mann and Taboada, 2005; Marcu, 2000) deals with automatically assigning a text a tree structure in which discourse segments and rhetorical relations between them are marked, such as Concession. For identifying the combinable segments, declarative rules are employed, which describe linguistic and structural cues and constraints about possible combinations by referring to different XML annotation layers of the input text, and external knowledge bases such as a discourse marker lexicon, a lexico-semantic ontology (later to be combined with a domain ontology), and an ontology of rhetorical relations. In our text-technological environment, the obvious choice of formalism to represent such ontologies is OWL (Smith et al., 2004). In this paper, we describe two OWL ontologies and how they are consulted from the discourse parser to solve certain tasks within DP. The first ontology is a taxononomy of rhetorical relations which was developed in the project. The second one is an OWL version of GermaNet, the model of which we designed together with our project partners.
Knowledge in textual form is always presented as visually and hierarchically structured units of text, which is particularly true in the case of academic texts. One research hypothesis of the ongoing project Knowledge ordering in texts - text structure and structure visualisations as sources of natural ontologies1 is that the textual structure of academic texts effectively mirrors essential parts of the knowledge structure that is built up in the text. The structuring of a modern dissertation thesis (e.g. in the form of an automatically generated table of contents - toes), for example, represents a compromise between requirements of the text type and the methodological and conceptual structure of its subject-matter. The aim of the project is to examine how visual-hierarchical structuring systems are constructed, how knowledge structures are encoded in them, and how they can be exploited to automatically derive ontological knowledge for navigation, archiving, or search tasks. The idea to extract domain concepts and semantic relations mainly from the structural and linguistic information gathered from tables of contents represents a novel approach to ontology learning.