Refine
Year of publication
Document Type
- Conference Proceeding (60)
- Part of a Book (27)
- Article (11)
- Other (1)
Language
- English (99) (remove)
Keywords
- Korpus <Linguistik> (40)
- Annotation (19)
- Digital Humanities (14)
- Computerlinguistik (13)
- Sprachdaten (11)
- XML (11)
- Auszeichnungssprache (10)
- Forschungsdaten (9)
- Langzeitarchivierung (8)
- Standardisierung (6)
Publicationstate
- Veröffentlichungsversion (68)
- Postprint (11)
- Zweitveröffentlichung (9)
Reviewstate
- (Verlags)-Lektorat (52)
- Peer-Review (26)
- Verlags-Lektorat (1)
Publisher
- European Language Resources Association (ELRA) (16)
- Extreme Markup Languages Conference (6)
- Springer (6)
- European language resources association (ELRA) (4)
- Oxford University Press (4)
- de Gruyter (4)
- European Language Resources Association (3)
- University of Illinois (3)
- University of Oulu (3)
- Universität Tübingen (3)
We describe a general two-stage procedure for re-using a custom corpus for spoken language system development involving a transformation from character-based markup to XML, and DSSSL stylesheet-driven XML markup enhancement with multiple lexical tag trees. The procedure was used to generate a fully tagged corpus; alternatively with greater economy of computing resources, it can be employed as a parametrised ‘tagging on demand’ filter. The implementation will shortly be released as a public resource together with the corpus (German spoken dialogue, about 500k word form tokens) and lexicon (about 75k word form types).
Co-reference annotation and resources: a multilingual corpus of typologically diverse languages
(2002)
This article introduces a dialogue corpus containing data from two typologically different languages, Japanese and Kilivila. The corpus is annotated in accordance with language specific annotation schemes for co-referential and similar relations. The article describes the corpus data, the properties of language specific co-reference in the two languages and a methodology for its annotation. Examples from the corpus show how this methodology is used in the workflow of the annotation process.
This paper deals with the problem of how to interrelate theory-specific treebanks and how to transform one treebank format to another. Currently, two approaches to achieve these goals can be differentiated. The first creates a mapping algorithm between treebank formats. Categories of a source format are transformed into a target format via a given set of general or language-specific mapping rules. The second relates treebanks via a transformation to a general model of linguistic categories, for example based on the EAGLES recommendations for syntactic annotations of corpora, or relying on the HPSG framework. This paper proposes a new methodology as a solution for these desiderata.
We present an approach on how to investigate what kind of semantic information is regularly associated with the structural markup of scientific articles. This approach addresses the need for an explicit formal description of the semantics of text-oriented XML-documents. The domain of our investigation is a corpus of scientific articles from psychology and linguistics from both English and German online available journals. For our analyses, we provide XML-markup representing two kinds of semantic levels: the thematic level (i.e. topics in the text world that the article is about) and the functional or rhetorical level. Our hypothesis is that these semantic levels correlate with the articles’ document structure also represented in XML. Articles have been annotated with the appropriate information. Each of the three informational levels is modelled in a separate XML document, since in our domain, the different description levels might conflict so that it is impossible to model them within a single XML document. For comparing and mining the resulting multi-layered XML annotations of one article, a Prolog-based approach is used. It focusses on the comparison of XML markup that is distributed among different documents. Prolog predicates have been defined for inferring relations between levels of information that are modelled in separate XML documents. We demonstrate how the Prolog tool is applied in our corpus analyses.
The motivation for this article is to describe a methodology for interrelating and analyzing language and theory-specific corpus data from various languages. As an example phenomeon we use information structure (IS, see [3]) in treebanks from three languages: Spanish, Korean and Japanese. Korean and Japanese are typologically close, while both are typologically different from Spanish. Therefore, the problem of annotating IS is that there are diverging language-specific formal linguistic means for the realization of IS-functions (like “topicalization / contrast”) on various levels like prosody, morphology and word-order. Hence, it is necessary to describe the relations between language-specific formal means and functional views on IS, and how to operationalize these relations for corpus analysis.
Overlap in markup occurs where some markup structures do not nest, such as where the structural division of the text into lists, sections, etc., differs from the syntactic division of the text into sentences and phrases. The Multiple Annotation solution to this problem (redundant encoding in multiple forms) has many advantages: it is based on XML, the modeling of alternative annotations is possible, each level can be viewed separately, and new levels can be added at any time. But it has the significant disadvantage of independence of the separate files. These multiply annotated files can be regarded as an interrelated unit, with the text serving as the implicit link. Two representations of the information contained in the multiple files (one in Prolog and one in XML) can be programmatically derived and used together for editing, for inference, or for unification of the multiply annotated documents.