Refine
Year of publication
Document Type
- Part of a Book (23) (remove)
Language
- English (23) (remove)
Has Fulltext
- yes (23)
Keywords
- Computerlinguistik (23) (remove)
Publicationstate
- Veröffentlichungsversion (10)
- Postprint (7)
- Preprint (1)
- Zweitveröffentlichung (1)
Reviewstate
- (Verlags)-Lektorat (9)
- Peer-Review (7)
- Peer-review (1)
- Verlags-Lektorat (1)
Publisher
- Springer (6)
- The Association for Computational Linguistics (3)
- de Gruyter (2)
- European language resources association (ELRA) (1)
- Graphen & Netzwerke; AG des Verbandes Digital Humanities im deutschsprachigen Raum e.V. (1)
- L'Harmattan (1)
- MIT Press (1)
- Narr (1)
- Oxford University Press (1)
- Routledge (1)
In this paper we present an evaluation of rule-based morphological components for German for use in an interactive editing environment. The criteria for the evaluation are deduced from the intended use of these components, namely availability, performance, programming interfaces, and analysis quality. We evaluated systems developed and maintained since decades as well as new systems. However, we note serious general shortcomings when looking closer at recent implementations and come to the conclusion that the oldest system is the only one that satisfies our requirements.
We use a convolutional neural network to perform authorship identification on a very homogeneous dataset of scientific publications. In order to investigate the effect of domain biases, we obscure words below a certain frequency threshold, retaining only their POS-tags. This procedure improves test performance due to better generalization on unseen data. Using our method, we are able to predict the authors of scientific publications in the same discipline at levels well above chance.
German is a language with complex morphological processes. Its long and often ambiguous word forms present a bottleneck problem in natural language processing. As a step towards morphological analyses of high quality, this paper introduces a morphological treebank for German. It is derived from the linguistic database CELEX which is a standard resource for German morphology. We build on its refurbished, modernized and partially revised version. The derivation of the morphological trees is not trivial, especially for such cases of conversions which are morpho-semantically opaque and merely of diachronic interest. We develop solutions and present exemplary analyses. The resulting database comprises about 40,000 morphological trees of a German base vocabulary whose format and grade of detail can be chosen according to the requirements of the applications. The Perl scripts for the generation of the treebank are publicly available on github. In our discussion, we show some future directions for morphological treebanks. In particular, we aim at the combination with other reliable lexical resources such as GermaNet.
Constructing a Corpus
(2016)
In this chapter, we discuss steps toward extending CMDI’s semantic interoperability beyond the Social Sciences and Humanities: We stress the need for an initial data curation step, in part supported by a relation registry that helps impose some structure on CMDI vocabulary; we describe the use of authority file information and other controlled vocabulary to help connecting CMDI-based metadata to existing Linked Data; we show how significant parts of CMDI-based metadata can be converted to bibliographic metadata standards and hence entered into library catalogs; and finally we describe first steps to convert CMDI-based metadata to RDF. The initial grassroots approach of CMDI (meaning that anybody can define metadata descriptors and components) mirrors the AAA slogan of the Semantic Web (“Anyone can say Anything about Any topic”). Ironically, this makes it hard to fully link CMDI-based metadata to other Semantic Web datasets. This paper discusses the challenges of this enterprise.
We introduce a method for error detection in automatically annotated text, aimed at supporting the creation of high-quality language resources at affordable cost. Our method combines an unsupervised generative model with human supervision from active learning. We test our approach on in-domain and out-of-domain data in two languages, in AL simulations and in a real world setting. For all settings, the results show that our method is able to detect annotation errors with high precision and high recall.
Question Answering Systems for retrieving information from Knowledge Graphs (KG) have become a major area of interest in recent years. Current systems search for words and entities but cannot search for grammatical phenomena. The purpose of this paper is to present our research on developing a QA System that answers natural language questions about German grammar.
Our goal is to build a KG which contains facts and rules about German grammar, and is also able to answer specific questions about a concrete grammatical issue. An overview of the current research in the topic of QA systems and ontology design is given and we show how we plan to construct the KG by integrating the data in the grammatical information system Grammis, hosted by the Leibniz-Institut für Deutsche Sprache (IDS). In this paper, we describe the construction of the initial KG, sketch our resulting graph, and demonstrate the effectiveness of such an approach. A grammar correction component will be part of a later stage. The paper concludes with the potential areas for future research.
Different Views on Markup
(2010)
In this chapter, two different ways of grouping information represented in document markup are examined: annotation levels, referring to conceptual levels of description, and annotation layers, referring to the technical realisation of markup using e.g. document grammars. In many current XML annotation projects, multiple levels are integrated into one layer, often leading to the problem of having to deal with overlapping hierarchies. As a solution, we propose a framework for XML-based multiple, independent XML annotation layers for one text, based on an abstract representation of XML documents with logical predicates. Two realisations of the abstract representation are presented, a Prolog fact base format together with an application architecture, and a specification for XML native databases. We conclude with a discussion of projects that have currently adopted this framework.
Discourse segmentation is the division of a text into minimal discourse segments, which form the leaves in the trees that are used to represent discourse structures. A definition of elementary discourse segments in German is provided by adapting widely used segmentation principles for English minimal units, while considering punctuation, morphology, sytax, and aspects of the logical document structure of a complex text type, namely scientific articles. The algorithm and implementation of a discourse segmenter based on these principles is presented, as well an evaluation of test runs.