Refine
Year of publication
- 2012 (10) (remove)
Document Type
- Article (5)
- Conference Proceeding (5)
Language
- English (10)
Has Fulltext
- yes (10)
Is part of the Bibliography
- no (10)
Keywords
- Standardisierung (5)
- Computerlinguistik (3)
- Korpus <Linguistik> (3)
- Annotation (2)
- Deutsch (2)
- Korpuslinguistik (2)
- Auszeichnungssprache (1)
- Best-Practice (1)
- Computerunterstützte Lexikographie (1)
- Datenverarbeitung (1)
Publicationstate
Reviewstate
Publisher
This paper presents the application of the <tiger2/> format to various linguistic scenarios with the aim of making it the standard serialisation for the ISO 24615 [1] (SynAF) standard. After outlining the main characteristics of both the SynAF metamodel and the <tiger2/> format, as extended from the initial Tiger XML format [2], we show through a range of different language families how <tiger2/> covers a variety of constituency and dependency based analyses.
The paper presents an XML schema for the representation of genres of computer-mediated communication (CMC) that is compliant with the encoding framework defined by the TEI. It was designed for the annotation of CMC documents in the project Deutsches Referenzkorpus zur internetbasierten Kommunikation (DeRiK), which aims at building a corpus on language use in the most popular CMC genres on the German-speaking Internet. The focus of the schema is on those CMC genres which are written and dialogic―such as forums, bulletin boards, chats, instant messaging, wiki and weblog discussions, microblogging on Twitter, and conversation on “social network” sites.
The schema provides a representation format for the main structural features of CMC discourse as well as elements for the annotation of those units regarded as “typical” for language use on the Internet. The schema introduces an element <posting>, which describes stretches of text that are sent to the server by a user at a certain point in time. Postings are the main constituting elements of threads and logfiles, which, in our schema, are the two main types of CMC macrostructures. For the microlevel of CMC documents (that is, the structure of the <posting> content), the schema introduces elements for selected features of Internet jargon such as emoticons, interaction words and addressing terms. It allows for easy anonymization of CMC data for purposes in which the annotated data are made publicly available and includes metadata which are necessary for referencing random excerpts from the data as references in dictionary entries or as results of corpus queries.
Documentation of the schema as well as encoding examples can be retrieved from the web at http://www.empirikom.net/bin/view/Themen/CmcTEI. The schema is meant to be a core model for representing CMC that can be modified and extended by others according to their own specific perspectives on CMC data. It could be a first step towards an integration of features for the representation of CMC genres into a future new version of the TEI Guidelines.
Our paper outlines a proposal for the consistent modeling of heterogeneous lexical structures in semasiological dictionaries, based on the element structures described in detail in chapter 9 (Dictionaries) of the TEI Guidelines. The core of our proposal describes a system of relatively autonomous lexical “crystals” that can, within the constraints of the relevant element’s definition, be combined to form complex structures for the description of morphological form, grammatical information, etymology, word-formation, and meaning for a lexical structure.
The encoding structures we suggest guarantee sustainability and support re-usability and interoperability of data. This paper presents case studies of encoding dictionary entries in order to illustrate our concepts and test their usability.
We comment on encoding issues involving <entry>, <form>, <etym>, and on refinements to the internal content of <sense>.
Although most of the relevant dictionary productions of the recent past have relied on digital data and methods, there is little consensus on formats and standards. The Institute for Corpus Linguistics and Text Technology (ICLTT) of the Austrian Academy of Sciences has been conducting a number of varied lexicographic projects, both digitising print dictionaries and working on the creation of genuinely digital lexicographic data. This data was designed to serve varying purposes: machine-readability was only one. A second goal was interoperability with digital NLP tools. To achieve this end, a uniform encoding system applicable across all the projects was developed. The paper describes the constraints imposed on the content models of the various elements of the TEI dictionary module and provides arguments in favour of TEI P5 as an encoding system not only being used to represent digitised print dictionaries but also for NLP purposes.
The present contribution addresses an infrastructural issue of universal relevance, addressed in the specific context of the TEI. We describe a combination of open-source tools and an open-access approach to creating knowledge repositories that have been employed in building a bibliographic reference library for the “TEI for Linguists” special interest group (LingSIG). The authors argue that, for an initiative such as the TEI, it is important to choose open, freely available solutions. If these solutions have the advantage of attracting new users and promoting the initiative itself, so much the better, especially if it is done in a non-committal way: no one using the LingSIG bibliographic repository has to be a member of the LingSIG or a “TEI-er” in general.
The present article describes the first stage of the KorAP project, launched recently at the Institut für Deutsche Sprache (IDS) in Mannheim, Germany. The aim of this project is to develop an innovative corpus analysis platform to tackle the increasing demands of modern linguistic research. The platform will facilitate new linguistic findings by making it possible to manage and analyse primary data and annotations in the petabyte range, while at the same time allowing an undistorted view of the primary linguistic data, and thus fully satisfying the demands of a scientific tool. An additional important aim of the project is to make corpus data as openly accessible as possible in light of unavoidable legal restrictions, for instance through support for distributed virtual corpora, user-defined annotations and adaptable user interfaces, as well as interfaces and sandboxes for user-supplied analysis applications. We discuss our motivation for undertaking this endeavour and the challenges that face it. Next, we outline our software implementation plan and describe development to-date.
Research today is often performed in collaborated projects composed of project partners with different backgrounds and from different institutions and countries. Standards can be a crucial tool to help harmonizing these differences and to create sustainable resources. However, choosing a standard depends on having enough information to evaluate and compare different annotation and metadata formats. In this paper we present ongoing work on an interactive, collaborative website that collects information on standards in the field of linguistics as a means to guide interested researchers.
This document presents ongoing work related to spoken language data within a project that aims to establish a common and unified infrastructure for the sustainable provision of linguistic primary research data at the Institut für Deutsche Sprache (IDS). In furtherance of its mission to “document the German language as it is currently used”, the project expects to enable the research community to access a broad empirical base of working material via a single platform. While the goal is to eventually cover all linguistically relevant digital resources of the IDS, including lexicographic information systems such as the IDS German Vocabulary Portal, OWID, written language corpora such as the IDS German Reference Corpus, DeReKo, and spoken language corpora such as the IDS German Speech Corpus for Research and Teaching, FOLK, the work presented here predominantly focuses on the latter type of data, i.e. speech corpora. Within this context, the present document pictures the project’s contributions to the development of standards and best practice guidelines concerning data storage, process documentation and legal issues for the sustainable preservation and long-term accessibility of primary linguistic research data.