OPUS 4 | Search

74 search hits

1 to 10

Sort by

Year
Year
Title
Title
Author
Author

Unification of XML Documents with Concurrent Markup (2004)

Witt, Andreas ; Lüngen, Harald ; Sasaki, Felix ; Goecke, Daniela

Standardisierung orthographischer Transkriptionen: Ein SGML/TEI-basierter Vorschlag für VERBMOBIL (VM-Memo 117) (1997)

Witt, Andreas ; Lüngen, Harald ; Gibbon, Dafydd

Enhancing speech corpus resources with multiple lexical tag layers (2000)

Witt, Andreas ; Lüngen, Harald ; Gibbon, Dafydd

We describe a general two-stage procedure for re-using a custom corpus for spoken language system development involving a transformation from character-based markup to XML, and DSSSL stylesheet-driven XML markup enhancement with multiple lexical tag trees. The procedure was used to generate a fully tagged corpus; alternatively with greater economy of computing resources, it can be employed as a parametrised ‘tagging on demand’ filter. The implementation will shortly be released as a public resource together with the corpus (German spoken dialogue, about 500k word form tokens) and lexicon (about 75k word form types).

Unification of XML Documents with Concurrent Markup (2005)

Witt, Andreas ; Goecke, Daniela ; Sasaki, Felix ; Lüngen, Harald

An approach to the unification of XML (Extensible Markup Language) documents with identical textual content and concurrent markup in the framework of XML-based multi-layer annotation is introduced. A Prolog program allows the possible relationships between element instances on two annotation layers that share PCDATA to be explored and also the computing of a target node hierarchy for a well-formed, merged XML document. Special attention is paid to identity conflicts between element instances, for which a default solution that takes into account metarelations that hold between element types on the different annotation layers is provided. In addition, rules can be specified by a user to prescribe how identity conflicts should be solved for certain element types.

Igel: Comparing document grammars using XQuery (2013)

Sperberg-McQueen, Christopher M. ; Schonefeld, Oliver ; Kupietz, Marc ; Lüngen, Harald ; Witt, Andreas

Igel is a small XQuery-based web application for examining a collection of document grammars; in particular, for comparing related document grammars to get a better overview of their differences and similarities. In its initial form, Igel reads only DTDs and provides only simple lists of constructs in them (elements, attributes, notations, parameter entities). Our continuing work is aimed at making Igel provide more sophisticated and useful information about document grammars and building the application into a useful tool for the analysis (and the maintenance!) of families of related document grammars

Building and Annotating a Corpus of German-Language Newsgroups (2015)

Schröck, Jasmin ; Lüngen, Harald

Usenet is a large online resource containing user-generated messages (news articles) organised in discussion groups (newsgroups) which deal with a wide variety of different topics. We describe the download, conversion, and annotation of a comprehensive German news corpus for integration in DeReKo, the German Reference Corpus hosted at the Institut für Deutsche Sprache in Mannheim.

Valenz und Kookkurrenz (2015)

Perkuhn, Rainer ; Belica, Cyril ; Keibel, Holger ; Kupietz, Marc ; Lüngen, Harald

Introduction: Modeling, Learning and Processing of Text-Technological Data Structures (2011)

Mehler, Alexander ; Kühnberger, Kai-Uwe ; Lobin, Henning ; Lüngen, Harald ; Storrer, Angelika ; Witt, Andreas

Researchers in many disciplines, sometimes working in close cooperation, have been concerned with modeling textual data in order to account for texts as the prime information unit of written communication. The list of disciplines includes computer science and linguistics as well as more specialized disciplines like computational linguistics and text technology. What many of these efforts have in common is the aim to model textual data by means of abstract data types or data structures that support at least the semi-automatic processing of texts in any area of written communication.

Building linguistic corpora from Wikipedia articles and discussions (2014)

Margaretha, Eliza ; Lüngen, Harald

Wikipedia is a valuable resource, useful as a lingustic corpus or a dataset for many kinds of research. We built corpora from Wikipedia articles and talk pages in the I5 format, a TEI customisation used in the German Reference Corpus (Deutsches Referenzkorpus - DeReKo). Our approach is a two-stage conversion combining parsing using the Sweble parser, and transformation using XSLT stylesheets. The conversion approach is able to successfully generate rich and valid corpora regardless of languages. We also introduce a method to segment user contributions in talk pages into postings.

Multi-Dimensional Markup: N-way relations as a generalisation over possible relations between annotation layers (2008)

Lüngen, Harald ; Witt, Andreas

1 to 10

Open Access

Refine

Author

Year of publication

Document Type

Language

Has Fulltext

Is part of the Bibliography

Keywords

Publicationstate

Reviewstate

Publisher

74 search hits