Methods for the semantic analysis of document markup

We present an approach on how to investigate what kind of semantic information is regularly associated with the structural markup of scientific articles. This approach addresses the need for an explicit formal description of the semantics of text-oriented XML-documents. The domain of our investigation is a corpus of scientific articles from psychology and linguistics from both English and German online available journals. For our analyses, we provide XML-markup representing two kinds of semantic levels: the thematic level (i.e. topics in the text world that the article is about) and the functional or rhetorical level. Our hypothesis is that these semantic levels correlate with the articles’ document structure also represented in XML. Articles have been annotated with the appropriate information. Each of the three informational levels is modelled in a separate XML document, since in our domain, the different description levels might conflict so that it is impossible to model them within a single XML document. For comparing and mining the resulting multi-layered XML annotations of one article, a Prolog-based approach is used. It focusses on the comparison of XML markup that is distributed among different documents. Prolog predicates have been defined for inferring relations between levels of information that are modelled in separate XML documents. We demonstrate how the Prolog tool is applied in our corpus analyses.

Metadaten
Author:	Petra Saskia Bayerl, Harald Lüngen GND, Daniela Goecke, Andreas Witt ORCiD GND, Daniel Naber
URN:	urn:nbn:de:bsz:mh39-48014
Parent Title (English):	Proceedings of the ACM Symposium on Document Engineering (DocEng 2003)
Publisher:	ACM
Place of publication:	New York
Editor:	Cécile Roisin, Ethan Munson, Christine Vanoirbeek
Document Type:	Conference Proceeding
Language:	English
Year of first Publication:	2003
Date of Publication (online):	2016/04/25
Publicationstate:	Veröffentlichungsversion
Reviewstate:	(Verlags)-Lektorat
Tag:	Information Extraction; Prolog; Semantic Analysis; XML
First Page:	161
Last Page:	170
DDC classes:	400 Sprache / 410 Linguistik
Open Access?:	ja
Linguistics-Classification:	Computerlinguistik
Licence (German):	Creative Commons - Namensnennung-Nicht kommerziell-Keine Bearbeitung 3.0 Deutschland

Open Access