OPUS 4 | Search

Legal status and regulation of the German language in the Federal Republic of Germany (2023)

Kamocki, Paweł ; Lobin, Henning ; Witt, Andreas ; Wöllstein, Angelika

Despite being an official language of several countries in Central and Western Europe, German is not formally recognised as the official language of the Federal Republic of Germany. However, in certain situations the use of the German language, including the spelling rules, is subject to state regulation (by acts of Federal Parliament orby administrative decisions). This article presents the content of this regulation, its scope, and the historical context in which it was adopted.

Summary of the panel discussion at the 19th annual EFNIL conference in Vilnius (2023)

Kamocki, Paweł ; Witt, Andreas

A panel discussion chaired by Andreas Wilt from the Leibniz Institute for the German Language rounded off the 19th Annual EFN1L Conference in Vilnius.

Open Science and language data: Expectations vs. reality. The role of research data infrastructures (2023)

Kamocki, Paweł ; Hinrichs, Erhard ; Leinen, Peter ; Springer, Sabine ; Witt, Andreas ; Zechmann, Dorothea

Language data are essential for any scientific endeavor. However, unlike numerical data, language data are often protected by copyright, as they easily meet the threshold of originality. The role of research infrastructures (such CLARIN, DARIAH, and Text+) is to bridge the gap between uses allowed by statutory exceptions and the requirements of Open Science. This is achieved on the one hand by sharing language data produced by research organisations with the widest possible circle of persons, and on the other by mutualizing efforts towards copyright clearance and appropriate licensing of datasets.

The CLARIN infrastructure as an interoperable language technology platform for SSH and beyond (2023)

Branco, António ; Eskevich, Maria ; Frontini, Francesca ; Hajič, Jan ; Hinrichs, Erhard ; de Jong, Franciska ; Kamocki, Paweł ; König, Alexander ; Lindén, Krister ; Navarretta, Constanza ; Piasecki, Maciej ; Piperidis, Stelios ; Pitkänen, Olli ; Simov, Kiril ; Skadiņa, Inguna ; Trippel, Thorsten ; Witt, Andreas ; Zinn, Claus

CLARIN is a European Research Infrastructure Consortium developing and providing a federated and interoperable platform to support scientists in the field of the Social Sciences and Humanities in carrying-out language-related research. This contribution provides an overview of the entire infrastructure with a particular focus on tool interoperability, ease of access to research data, tools and services, the importance of sharing knowledge within and across (national) communities, and community building. By taking into account FAIR principles from the very beginning, CLARIN succeeded in becoming a successful example of a research infrastructure that is actively used by its members. The benefits CLARIN members reap from their infrastructure secure a future for their common good that is both sustainable and attractive to partners beyond the original target groups.

From <tiger2/> to ISOTiger – community driven developments for syntax annotation in SynAF (2014)

Bosch, Sonja ; Eckart, Kerstin ; Faaß, Gertrud ; Heid, Ulrich ; Lee, Kiyong ; Pareja-Lora, Antonio ; Pretorius, Laurette ; Romary, Laurent ; Witt, Andreas ; Zeldes, Amir ; Zipser, Florian

In 2010, ISO published a standard for syntactic annotation, ISO 24615:2010 (SynAF). Back then, the document specified a comprehensive reference model for the representation of syntactic annotations, but no accompanying XML serialisation. ISO’s subcommittee on language resource management (ISO TC 37/SC 4) is working on making the SynAF serialisation ISOTiger an additional part of the standard. This contribution addresses the current state of development of ISOTiger, along with a number of open issues on which we are seeking community feedback in order to ensure that ISOTiger becomes a useful extension to the SynAF reference model.

Processing of information: Linguistics, HKI and Digital Humanities. Developments in Cologne (2023)

Eide, Øyvind ; Hermes, Jürgen ; Witt, Andreas

L’article intitulé «Traitement de l’information: Spinfo, HKI et humanités numériques - l’expérience de Cologne» présente l’histoire du développement des humanités numériques au sein de l’Université de Cologne. L'institutionnalisation des humanités numériques a commencé encore à l’époque où dans le monde germanophone le périmètre de la discipline était en train d’être défini par les travaux de quelques pionniers. Parmi eux, il convient de souligner le rôle d’Elisabeth Burr, active notamment à Tubingue, Duisbourg, Brême et Leipzig.L’article retrace le développement des humanités numériques à Cologne à partir de leurs débuts dans les années soixante du 20ème siècle, en passant par leur consolidation dans les années quatre-vingt-dix, jusqu’aux deux dernières décennies, quand Cologne est devenu un centre important de cette discipline. Le processus illustre comment une nouvelle discipline scientifique peut s’institutionnaliser au sein d’une université allemande. L’article décrit la perspective de deux domaines fondateurs: le traitement linguistique de l’information (en allemand: Sprachliche Informationsverarbeitung, Spinfo) et le traitement historico-culturel de l’information (en allemand: Historisch Kulturwissenschaftliche Informationsverarbeitung, HKI) et leur synthèse, qui a abouti en 2017 à la création de l’Institut des Humanités Numériques (Digital Humanities), qui aujourd’hui est - du point de vue interne - une composante de la Faculté de Philosophie de l’Université de Cologne et - du point de vue externe - une partie intégrante de la communauté internationale des humanités numériques.

The future is now. The digital transformation in the German linguistics community and the key role of the IDS (2022)

Witt, Andreas ; Kamocki, Paweł

The Leibniz-Institute for the German Language (IDS) was established in Mannheim in 1964. Since then, it has been at the forefront of innovation in German linguistics as a hub for digital language data. This chapter presents various lessons learnt from over five decades of work by the IDS, ranging from the importance of sustainability, through its strong technical base and FAIR principles, to the IDS’ role in national and international cooperation projects and its expertise on legal and ethical issues related to language resources and language technology.

Preface (2022)

Fišer, Darja ; Witt, Andreas

Language matters. The European research infrastructure CLARIN, today and tomorrow (2022)

de Jong, Franciska ; Van Uytvanck, Dieter ; Frontini, Francesca ; van den Bosch, Antal ; Fišer, Darja ; Witt, Andreas

CLARIN stands for “Common Language Resources and Technology Infrastructure”. In 2012 CLARIN ERIC was established as a legal entity with the mission to create and maintain a digital infrastructure to support the sharing, use, and sustainability of language data (in written, spoken, or multimodal form) available through repositories from all over Europe, in support of research in the humanities and social sciences and beyond. Since 2016 CLARIN has had the status of Landmark research infrastructure and currently it provides easy and sustainable access to digital language data and also offers advanced tools to discover, explore, exploit, annotate, analyse, or combine such datasets, wherever they are located. This is enabled through a networked federation of centres: language data repositories, service centres, and knowledge centres with single sign-on access for all members of the academic community in all participating countries. In addition, CLARIN offers open access facilities for other interested communities of use, both inside and outside of academia. Tools and data from different centres are interoperable, so that data collections can be combined and tools from different sources can be chained to perform operations at different levels of complexity. The strategic agenda adopted by CLARIN and the activities undertaken are rooted in a strong commitment to the Open Science paradigm and the FAIR data principles. This also enables CLARIN to express its added value for the European Research Area and to act as a key driver of innovation and contributor to the increasing number of industry programmes running on data-driven processes and the digitalization of society at large.

Ethical issues in language resources and language technology – Tentative taxonomy (2022)

Kamocki, Paweł ; Witt, Andreas

Ethical issues in Language Resources and Language Technology are often invoked, but rarely discussed. This is at least partly because little work has been done to systematize ethical issues and principles applicable in the fields of Language Resources and Language Technology. This paper provides an overview of ethical issues that arise at different stages of Language Resources and Language Technology development, from the conception phase through the construction phase to the use phase. Based on this overview, the authors propose a tentative taxonomy of ethical issues in Language Resources and Language Technology, built around five principles: Privacy, Property, Equality, Transparency and Freedom. The authors hope that this tentative taxonomy will facilitate ethical assessment of projects in the field of Language Resources and Language Technology, and structure the discussion on ethical issues in this domain, which may eventually lead to the adoption of a universally accepted Code of Ethics of the Language Resources and Language Technology community.

Word sense alignment and disambiguation for historical encyclopedias (2021)

Hagen, Thora ; Jannidis, Fotis ; Witt, Andreas

This paper will address the challenge of creating a knowledge graph from a corpus of historical encyclopedias with a special focus on word sense alignment (WSA) and disambiguation (WSD). More precisely, we examine WSA and WSD approaches based on article similarity to link messy historical data, utilizing Wikipedia as aground-truth component – as the lack of a critical overlap in content paired with the amount of variation between and within the encyclopedias does not allow for choosing a ”baseline” encyclopedia to align the others to. Additionally, we are comparing the disambiguation performance of conservative methods like the Lesk algorithm to more recent approaches, i.e. using language models to disambiguate senses.

A pragmatic approach to XML interoperability – the Component Metadata Infrastructure (CMDI) (2011)

Broeder, Daan ; Schonefeld, Oliver ; Trippel, Thorsten ; Van Uytvanck, Dieter ; Witt, Andreas

XML has been designed for creating structured documents, but the information that is encoded in these structures are, by definition, out of scope for XML. Additional sources, normally not easily interpretable by computers, such as documentation are needed to determine the intention of specific tags in a tag-set. The Component Metadata Infrastructure (CMDI) takes a rather pragmatic approach to foster interoperability between XML instances in the domain of metadata descriptions for language resources. This paper gives an overview of this approach.

Digital research infrastructure (2021)

Stührenberg, Maik ; Schonefeld, Oliver ; Witt, Andreas

Digital research infrastructures can be divided into four categories: large equipment, IT infrastructure, social infrastructure, and information infrastructure. Modern research institutions often employ both IT infrastructure and information infrastructure, such as databases or large-scale research data. In addition, information infrastructure depends to some extent on IT infrastructure. In this paper, we discuss the IT, information, and legal infrastructure issues that research institutions face.

Twenty-two historical encyclopedias encoded in TEI: a new resource for the Digital Humanities (2020)

Hagen, Thora ; Ketzan, Erik ; Jannidis, Fotis ; Witt, Andreas

This paper accompanies the corpus publication of EncycNet, a novel XML/TEI annotated corpus of 22 historical German encyclopedias from the early 18th to early 20th century. We describe the creation and annotation of the corpus, including the rationale for its development, suggested methodology for TEI annotation, possible use cases and future work. While many well-developed annotation standards for lexical resources exist, none can adequately model the encyclopedias at hand, and we therefore suggest how the TEI Lex-0 standard may be modified with additional guidelines for the annotation of historical encyclopedias. As the digitization and annotation of historical encyclopedias are settling on TEI as the de facto standard, our methodology may inform similar projects.

Liability of CLARIN Centres as Service Providers: What Changes with the New Directive on Copyright in the Digital Single Market? (2020)

Kamocki, Paweł ; Ketzan, Erik ; Wildgans, Julia ; Witt, Andreas

Providing online repositories for language resources is one of the main activities of CLARIN centres. The legal framework regarding liability of Service Providers for content uploaded by their users has recently been modified by the new Directive on Copyright in the Digital Single Market. A new category of Service Providers, Online Content-Sharing Service Providers (OCSSPs), was added. It is subject to a complex and strict framework, including the requirement to obtain licenses from rightholders for the hosted content. This paper provides the background and effect of these changes to law and aims to initiate a debate on how CLARIN repositories should navigate this new legal landscape.

Interoperability in an Infrastructure Enabling Multidisciplinary Research: The case of CLARIN (2020)

de Jong, Franciska ; Maegaard, Bente ; Fišer, Darja ; van Uytvanck, Dieter ; Witt, Andreas

CLARIN is a European Research Infrastructure providing access to language resources and technologies for researchers in the humanities and social sciences. It supports the use and study of language data in general and aims to increase the potential for comparative research of cultural and societal phenomena across the boundaries of languages and disciplines, all in line with the European agenda for Open Science. Data infrastructures such as CLARIN have recently embarked on the emerging frameworks for the federation of infrastructural services, such as the European Open Science Cloud and the integration of services resulting from multidisciplinary collaboration in federated services for the wider domain of the social sciences and humanities (SSH). In this paper we describe the interoperability requirements that arise through the existing ambitions and the emerging frameworks. The interoperability theme will be addressed at several levels, including organisation and ecosystem, design of workflow services, data curation, performance measurement and collaboration. For each level, some concrete outcomes are described.

Privacy by Design and Language Resources (2020)

Kamocki, Paweł ; Witt, Andreas

Privacy by Design (also referred to as Data Protection by Design) is an approach in which solutions and mechanisms addressing privacy and data protection are embedded through the entire project lifecycle, from the early design stage, rather than just added as an additional layer to the final product. Formulated in the 1990 by the Privacy Commissionner of Ontario, the principle of Privacy by Design has been discussed by institutions and policymakers on both sides of the Atlantic, and mentioned already in the 1995 EU Data Protection Directive (95/46/EC). More recently, Privacy by Design was introduced as one of the requirements of the General Data Protection Regulation (GDPR), obliging data controllers to define and adopt, already at the conception phase, appropriate measures and safeguards to implement data protection principles and protect the rights of the data subject. Failing to meet this obligation may result in a hefty fine, as it was the case in the Uniontrad decision by the French Data Protection Authority (CNIL). The ambition of the proposed paper is to analyse the practical meaning of Privacy by Design in the context of Language Resources, and propose measures and safeguards that can be implemented by the community to ensure respect of this principle.

Beyond Citations: Corpus-based Methods for Detecting the Impact of Research Outcomes on Society (2020)

Rezapour, Rezvaneh ; Bopp, Jutta ; Fiedler, Norman ; Steffen, Diana ; Witt, Andreas ; Diesner, Jana

This paper proposes, implements and evaluates a novel, corpus-based approach for identifying categories indicative of the impact of research via a deductive (top-down, from theory to data) and an inductive (bottom-up, from data to theory) approach. The resulting categorization schemes differ in substance. Research outcomes are typically assessed by using bibliometric methods, such as citation counts and patterns, or alternative metrics, such as references to research in the media. Shortcomings with these methods are their inability to identify impact of research beyond academia (bibliometrics) and considering text-based impact indicators beyond those that capture attention (altmetrics). We address these limitations by leveraging a mixed-methods approach for eliciting impact categories from experts, project personnel (deductive) and texts (inductive). Using these categories, we label a corpus of project reports per category schema, and apply supervised machine learning to infer these categories from project reports. The classification results show that we can predict deductively and inductively derived impact categories with 76.39% and 78.81% accuracy (F1-score), respectively. Our approach can complement solutions from bibliometrics and scientometrics for assessing the impact of research and studying the scope and types of advancements transferred from academia to society.

The DRuKoLA project (2019)

Kupietz, Marc ; Cosma, Ruxandra ; Witt, Andreas

DRuKoLA, the accompanying project in the making of the Corpus of Romanian Language, is a cooperation between German and Romanian computer scientists, corpus linguists and linguists, aiming at linking reference corpora of European languages under one corpus analysis tool able to manage big data. KorAP, the analysis tool developed at the Leibniz Institute for the German Language (Mannheim), is being tailored for the Romanian language in a first attempt to reunite reference corpora under the EuReCo initiative, detailed in this paper. The paper describes the necessary steps of harmonization within KorAP and the corpus of Romanian language and discusses, as one important goal of this project, criteria and ways to build virtual comparable corpora to be used for contrastive linguistic analyses.

Dealing with overlapping structures (2019)

Witt, Andreas ; Stegmann, Beeke ; Paulsen, Robert ; Haugen, Odd Einar

Modeling and annotating complex data structures (2019)

Bański, Piotr ; Witt, Andreas

Toward a CLARIN Data Protection Code of Conduct (2018)

Kamocki, Pawel ; Ketzan, Erik ; Wildgans, Julia ; Witt, Andreas

This abstract discusses the possibility to adopt a CLARIN Data Protection Code of Conduct pursuant art. 40 of the General Data Protection Regulation. Such a code of conduct would have important benefits for the entire language research community. The final section of this abstract proposes a roadmap to the CLARIN Data Protection Code of Conduct, listing various stages of its drafting and approval procedures.

New exceptions for Text and Data Mining and their possible impact on the CLARIN infrastructure (2018)

Kamocki, Pawel ; Ketzan, Erik ; Wildgans, Julia ; Witt, Andreas

The proposed paper discusses new exceptions for Text and Data Mining that have recently been adopted in some EU Member States, and probably will soon be adopted also at the EU level. These exceptions are of great significance for language scientists, as they exempt those who compile corpora from the obligation to obtain authorisation from rightholders. However, corpora compiled on the basis of such exceptions cannot be freely shared, which in a long run may have serious consequences for Open Science and the functioning of research infrastructure such as CLARIN ERIC.

Language Resources, Taxonomies and Metadata (2009)

Lemnitzer, Lothar ; Hinrichs, Erhard ; Witt, Andreas

In this paper we present an approach to faceted search in large language resource repositories. This kind of search which enables users to browse through the repository by choosing their personal sequence of facets heavily relies on the availability of descriptive metadata for the objects in the repository. This approach therefore informs the collection of a minimal set of metatdata for language resources. The work described in this paper has been funded by the EC within the ESFRI infrastructure project CLARIN.

Towards declarative descriptions of transformations: An approach based on topic maps (2002)

Lenz, Eva ; Witt, Andreas ; Storrer, Angelika

[tiger2] As a standardized serialisation for ISO 24615 - SynAF (2012)

Bosch, Sonja ; Choi, Key-Sun ; de la Clergerie, Éric Villemonte ; Fang, Alex Chengyu ; Faaß, Gertrud ; Lee, Kiyong ; Pareja-Lora, Antonio ; Romary, Laurent ; Witt, Andreas ; Zeldes, Amir ; Zipser, Forian

This paper presents the application of the <tiger2/> format to various linguistic scenarios with the aim of making it the standard serialisation for the ISO 24615 [1] (SynAF) standard. After outlining the main characteristics of both the SynAF metamodel and the <tiger2/> format, as extended from the initial Tiger XML format [2], we show through a range of different language families how <tiger2/> covers a variety of constituency and dependency based analyses.

Data Formats for Phonological Corpora (2013)

Romary, Laurent ; Witt, Andreas

The goal of the present chapter is to explore the possibility of providing the research (but also the industrial) community that commonly uses spoken corpora with a stable portfolio of well-documented standardized formats that allow a high reuse rate of annotated spoken resources and, as a consequence, better interoperability across tools used to produce or exploit such resources.

TEI-based XML-Applications: Transcriptions (1998)

Witt, Andreas

Introducing the CLARIN knowledge centre for linguistic diversity and language documentation (2018)

Hedeland, Hanna ; Lehmberg, Timm ; Rau, Felix ; Salffner, Sophie ; Seyfeddinipur, Mandana ; Witt, Andreas

The European digital research infrastructure CLARIN (Common Language Resources and Technology Infrastructure) is building a Knowledge Sharing Infrastructure (KSI) to ensure that existing knowledge and expertise is easily available both for the CLARIN community and for the humanities research communities for which CLARIN is being developed. Within the Knowledge Sharing Infrastructure, so called Knowledge Centres comprise one or more physical institutions with particular expertise in certain areas and are committed to providing their expertise in the form of reliable knowledge-sharing services. In this paper, we present the ninth K Centre – the CLARIN Knowledge Centre for Linguistic Diversity and Language Documentation (CKLD) – and the expertise and services provided by the member institutions at the Universities of London (ELAR/SWLI), Cologne (DCH/IfDH/IfL) and Hamburg (HZSK/INEL). The centre offers information on current best practices, available resources and tools, and gives advice on technological and methodological matters for researchers working within relevant fields.

The German reference corpus DeReKo: new developments – new opportunities (2018)

Kupietz, Marc ; Lüngen, Harald ; Kamocki, Paweł ; Witt, Andreas

This paper discusses current trends in DeReKo, the German Reference Corpus, concerning legal issues around the recent German copyright reform with positive implications for corpus building and corpus linguistics in general, recent corpus extensions in the genres of popular magazines, journals, historical texts, and web-based football reports. Besides, DeReKo is finally accessible via the new corpus research platform KorAP, offering registered users several news features in comparison with its predecessor COSMAS II.

Interrelating Treebanks with Language-Specific Descriptions of Information Structure (2004)

Storbeck, Daniel ; Kwon, Sanghee ; Sasaki, Felix ; Witt, Andreas

The motivation for this article is to describe a methodology for interrelating and analyzing language and theory-specific corpus data from various languages. As an example phenomeon we use information structure (IS, see [3]) in treebanks from three languages: Spanish, Korean and Japanese. Korean and Japanese are typologically close, while both are typologically different from Spanish. Therefore, the problem of annotating IS is that there are diverging language-specific formal linguistic means for the realization of IS-functions (like “topicalization / contrast”) on various levels like prosody, morphology and word-order. Hence, it is necessary to describe the relations between language-specific formal means and functional views on IS, and how to operationalize these relations for corpus analysis.

Representing human and machine dictionaries in markup languages (SGML, XML) (2013)

Lemnitzer, Lothar ; Romary, Laurent ; Witt, Andreas

Bridging standards development and infrastructure usage by means of concept graphs: the liaison of CLARIN and ISO TC37SC4 in practice (2018)

Suchowolec, Karolina ; Bański, Piotr ; Witt, Andreas

The present submission reports on a pilot project conducted at the Institute for the German Language (IDS), aiming at strengthening the connection between ISO TC37SC4 “Language Resource Management” and the CLARIN infrastructure. In terminology management, attempts have recently been made to use graph-theoretical analyses to get a better understanding of the structure of terminology resources. The project described here aims at applying some of these methods to potentially incomplete concept fields produced over years by numerous researchers serving as experts and editors of ISO standards. The main results of the project are twofold. On the one hand, they comprise concept networks dynamically generated from a relational database and browsable by the user. On the other, the project has yielded significant qualitative feedback that will be offered to ISO. We provide the institutional context of this endeavour, its theoretical background, and an overview of data preparation and tools used. Finally, we discuss the results and illustrate some of them.

Impact of scientific research beyond academia: an alternative classification schema (2018)

Witt, Andreas ; Diesner, Jana ; Steffen, Diana ; Rezapour, Rezvaneh ; Bopp, Jutta ; Fiedler, Norman ; Köller, Christoph ; Raster, Manu ; Wockenfuß, Jennifer

The actual or anticipated impact of research projects can be documented in scientific publications and project reports. While project reports are available at varying level of accessibility, they might be rarely used or shared outside of academia. Moreover, a connection between outcomes of actual research project and potential secondary use might not be explicated in a project report. This paper outlines two methods for classifying and extracting the impact of publicly funded research projects. The first method is concerned with identifying impact categories and assigning these categories to research projects and their reports by extension by using subject matter experts; not considering the content of research reports. This process resulted in a classification schema that we describe in this paper. With the second method which is still work in progress, impact categories are extracted from the actual text data.

German history-digital: A platform for transnational historical knowledge co-creation (2017)

Hiebert, Matthew ; Lässig, Simone ; Witt, Andreas

The German Historical Institute Washington (GHI) is in the development phase of German History Digital (GH-D), a transatlantic digital initiative to meet the scholarly needs of historians and their students facing new historiographical and technological challenges. In the proposed paper we will discuss the research goals, methodology, prototyping, and development strategy of GH-D as infrastructure to facilitate transnational historical knowledge co-creation for the large community of researchers and students already relying on digital resources of the GHI and for the growing constituency of citizen scholars.

WorldViews: Access to international textbooks for digital humanities researchers (2017)

Hennicke, Steffen ; Stahn, Lena-Luise ; De Luca, Ernesto William ; Schwedes, Kerstin ; Witt, Andreas

This paper introduces the field of international textbook research and discusses how the WorldViews project is working towards enhanced access to textbook resources for digital humanities research.

EuReCo - Joining Forces for a European Reference Corpus as a sustainable base for cross-linguistic research (2017)

Kupietz, Marc ; Witt, Andreas ; Bański, Piotr ; Tufiş, Dan ; Cristea, Dan ; Váradi, Tamás

In this paper we discuss the opportunities, prerequisites, possible applications and implications of a virtually joint corpus based on existing national, reference or other large corpora and their host institutions.

Sustainability of Linguistic Resources Revisited (2010)

Rehm, Georg ; Schonefeld, Oliver ; Trippel, Thorsten ; Witt, Andreas

This paper discusses work on the sustainability of linguistic resources as it was conducted in various projects, including the work of a three year project Sustainability of Linguistic Resources which finished in December 2008, a follow-up project, Sustainable linguistic data, and initiatives related to the work of the International Organization of Standardization (ISO) on developing standards for linguistic resources. The individual projects have been conducted at German collaborative research centres at the Universities of Potsdam, Hamburg and Tübingen, where the sustainability work was coordinated.

Freestyle Markup Language. Specification of an intuitive, powerful, polyhierarchical new extensible markup language (2010)

Pondorf, Denis ; Witt, Andreas

This paper provides a new generation of a markup language by introducing the Freestyle Markup Language (FML). Demands placed on the language are elaborated, considering current standards and discussions. Conception, a grammatical definition, a corresponding object graph and the bi-directional unambiguous transformation between these two congruent representation forms are set up. The result of this paper is a fundamental definition of a completely new markup language, consolidating many deficiency-discourses and experiences into one particular implementation concept, encouraging the evolution of markup.

TEI Feature Structures as a Representation Format for Multiple Annotation and Generic XML Documents (2009)

Stegmann, Jens ; Witt, Andreas

Feature structures are mathematical entities (rooted labeled directed acyclic graphs) that can be represented as graph displays, attribute value matrices or as XML adhering to the constraints of a specialized TEI tag set. We demonstrate that this latter ISO-standardized format can be used as an integrative storage and exchange format for sets of multiple annotation XML documents. This specific domain of application is rooted in the approach of multiple annotations, which marks a possible solution for XML-compliant markup in scenarios with conflicting annotation hierarchies. A more extreme proposal consists in the possible use as a meta-representation format for generic XML documents. For both scenarios our strategy concerning pertinent feature structure representations is grounded on the XDM (XQuery 1.0 and XPath 2.0 Data Model). The ubiquitous hierarchical and sequential relationships within XML documents are represented by specific features that take ordered list values. The mapping to the TEI feature structure format has been implemented in the form of an XSLT 2.0 stylesheet. It can be characterized as exploiting aspects of both the push and pull processing paradigm as appropriate. An indexing mechanism is provided with regard to the multiple annotation documents scenario. Hence, implicit links concerning identical primary data are made explicit in the result format. In comparison to alternative representations, the TEI-based format does well in many respects, since it is both integrative and well-formed XML. However, the result documents tend to grow very large depending on the size of the input documents and their respective markup structure. This may also be considered as a downside regarding the proposed use for generic XML documents. On the positive side, it may be possible to achieve a hookup to methods and applications that have been developed for feature structure representations in the fields of (computational) linguistics and knowledge representation.

A standards-related web-based information system (2012)

Stührenberg, Maik ; Schonefeld, Oliver ; Witt, Andreas

This late breaking proposal introduces the prototype of a web-based information system dealing with standards in the field of annotation.

DRuKoLA – towards contrastive German-Romanian research based on comparable corpora (2016)

Cosma, Ruxandra ; Cristea, Dan ; Kupietz, Marc ; Tufiş, Dan ; Witt, Andreas

This paper introduces the recently started DRuKoLA-project that aims at providing mechanisms to flexibly draw virtual comparable corpora from the German Reference Corpus DeReKo and the Reference Corpus of Contemporary Romanian Language CoRoLa in order to use these virtual corpora as empirical basis for contrastive linguistic research.

Corpus Query Lingua Franca (CQLF) (2016)

Bański, Piotr ; Frick, Elena ; Witt, Andreas

The present paper describes Corpus Query Lingua Franca (ISO CQLF), a specification designed at ISO Technical Committee 37 Subcommittee 4 “Language resource management” for the purpose of facilitating the comparison of properties of corpus query languages. We overview the motivation for this endeavour and present its aims and its general architecture. CQLF is intended as a multi-part specification; here, we concentrate on the basic metamodel that provides a frame that the other parts fit in.

KorAP architecture – diving in the deep sea of corpus data (2016)

Diewald, Nils ; Hanl, Michael ; Margaretha, Eliza ; Bingel, Joachim ; Kupietz, Marc ; Bański, Piotr ; Witt, Andreas

KorAP is a corpus search and analysis platform, developed at the Institute for the German Language (IDS). It supports very large corpora with multiple annotation layers, multiple query languages, and complex licensing scenarios. KorAP’s design aims to be scalable, flexible, and sustainable to serve the German Reference Corpus DEREKO for at least the next decade. To meet these requirements, we have adopted a highly modular microservice-based architecture. This paper outlines our approach: An architecture consisting of small components that are easy to extend, replace, and maintain. The components include a search backend, a user and corpus license management system, and a web-based user frontend. We also describe a general corpus query protocol used by all microservices for internal communications. KorAP is open source, licensed under BSD-2, and available on GitHub.

A Web-Platform for Preserving, Exploring, Visualising, and Querying Linguistic Corpora and other Resources (2008)

Rehm, Georg ; Schonefeld, Oliver ; Witt, Andreas ; Chiarcos, Christian ; Lehmberg, Timm

We present SPLICR, the Web-based Sustainability Platform for Linguistic Corpora and Resources. The system is aimed at people who work in Linguistics or Computational Linguistics: a comprehensive database of metadata records can be explored in order to find language resources that could be appropriate for one’s spe cific research needs. SPLICR also provides a graphical interface that enables users to query and to visualise corpora. The project in which the system is developed aims at sustainably archiving the ca. 60 language resources that have been constructed in three collaborative research centres. Our project has two primary goals: (a) To process and to archive sustainably the resources so that they are still available to the research community in five, ten, or even 20 years time. (b) To enable researchers to query the resources both on the level of their metadata as well as on the level of linguistic annotations. In more general terms, our goal is to enable solutions that leverage the interoperability, reusability, and sustainability of heterogeneous collec- tions of language resources.

Methods for the semantic analysis of document markup (2003)

Bayerl, Petra Saskia ; Lüngen, Harald ; Goecke, Daniela ; Witt, Andreas ; Naber, Daniel

We present an approach on how to investigate what kind of semantic information is regularly associated with the structural markup of scientific articles. This approach addresses the need for an explicit formal description of the semantics of text-oriented XML-documents. The domain of our investigation is a corpus of scientific articles from psychology and linguistics from both English and German online available journals. For our analyses, we provide XML-markup representing two kinds of semantic levels: the thematic level (i.e. topics in the text world that the article is about) and the functional or rhetorical level. Our hypothesis is that these semantic levels correlate with the articles’ document structure also represented in XML. Articles have been annotated with the appropriate information. Each of the three informational levels is modelled in a separate XML document, since in our domain, the different description levels might conflict so that it is impossible to model them within a single XML document. For comparing and mining the resulting multi-layered XML annotations of one article, a Prolog-based approach is used. It focusses on the comparison of XML markup that is distributed among different documents. Prolog predicates have been defined for inferring relations between levels of information that are modelled in separate XML documents. We demonstrate how the Prolog tool is applied in our corpus analyses.

Igel: Comparing document grammars using XQuery (2013)

Sperberg-McQueen, Christopher M. ; Schonefeld, Oliver ; Kupietz, Marc ; Lüngen, Harald ; Witt, Andreas

Igel is a small XQuery-based web application for examining a collection of document grammars; in particular, for comparing related document grammars to get a better overview of their differences and similarities. In its initial form, Igel reads only DTDs and provides only simple lists of constructs in them (elements, attributes, notations, parameter entities). Our continuing work is aimed at making Igel provide more sophisticated and useful information about document grammars and building the application into a useful tool for the analysis (and the maintenance!) of families of related document grammars

GOLD and Discourse: Domain- and Community-Specific Extensions (2005)

Goecke, Daniela ; Lüngen, Harald ; Sasaki, Felix ; Witt, Andreas ; Farrar, Scott

Integrated Linguistic Annotation Models and Their Application in the Domain of Antecedent Detection (2011)

Witt, Andreas ; Stührenberg, Maik ; Goecke, Daniela ; Metzing, Dieter

Seamless integration of various, often heterogeneous linguistic resources in terms of their output formats and a combined analysis of the respective annotation layers are crucial tasks for linguistic research. After a decade of concentration on the development of formats to structure single annotations for specific linguistic issues, in the last years a variety of specifications to store multiple annotations over the same primary data has been developed. The paper focuses on the integration of the knowledge resource logical document structure information into a text document to enhance the task of automatic anaphora resolution both for the task of candidate detection and antecedent selection. The paper investigates data structures necessary for knowledge integration and retrieval.

Different Views on Markup (2010)

Goecke, Daniela ; Lüngen, Harald ; Metzing, Dieter ; Stührenberg, Maik ; Witt, Andreas

In this chapter, two different ways of grouping information represented in document markup are examined: annotation levels, referring to conceptual levels of description, and annotation layers, referring to the technical realisation of markup using e.g. document grammars. In many current XML annotation projects, multiple levels are integrated into one layer, often leading to the problem of having to deal with overlapping hierarchies. As a solution, we propose a framework for XML-based multiple, independent XML annotation layers for one text, based on an abstract representation of XML documents with logical predicates. Two realisations of the abstract representation are presented, a Prolog fact base format together with an application architecture, and a specification for XML native databases. We conclude with a discussion of projects that have currently adopted this framework.

Open Access

Refine

Author

Year of publication

Document Type

Language

Has Fulltext

Is part of the Bibliography

Keywords

Publicationstate

Reviewstate

Publisher

99 search hits