Refine
Year of publication
Document Type
- Conference Proceeding (31)
- Article (8)
- Part of a Book (5)
- Other (5)
- Working Paper (4)
Has Fulltext
- yes (53)
Keywords
- Forschungsdaten (29)
- Datenmanagement (23)
- Infrastruktur (20)
- Metadaten (20)
- Forschung (13)
- Computerlinguistik (12)
- Korpus <Linguistik> (9)
- CLARIN (8)
- Digital Humanities (6)
- Standardisierung (6)
Publicationstate
- Veröffentlichungsversion (43)
- Zweitveröffentlichung (7)
- Postprint (6)
Reviewstate
- Peer-Review (38)
- (Verlags)-Lektorat (8)
Publisher
This paper proposes a methodology for querying linguistic data represented in different corpus formats. Examples of the need for queries over such heterogeneous resources are the corpus-based analysis of multimodal phenomena like the interaction of gestures and prosodic features, or syntax-related phenomena like information structure which exceed the expressive power of a tree-centered corpus format. Query languages (QLs) currently under development are strongly connected to corpus formats, like the NITE Object Model (NOM, Carletta et al., 2003) or the Meta-Annotation Infrastructure for ATLAS (MAIA, Laprun and Fiscus, 2002). The parallel development of linguistic query languages and corpus formats is due to the fact that general purpose query languages like XQuery (Boag et al., 2003) do not fulfill the changing needs of linguistically motivated queries, e.g. to give access to (non-)hierarchically organized, theory and language dependent annotations of multi modal signals and/or text. This leads to the problem that existing corpus formats and query languages are hard to reuse. They have to be re developed and re-implemented time-consumingly and expensively for unforeseen tasks. This paper describes an approach for overcoming these problems and a sample application.
In this contribution we present some work of the R&D European project “LIRICS” and of the ISO/TC 37/SC 4 committee related to the topic of interoperability and re-use of language resources. We introduce some basic mechanisms of the standardization work in ISO and describe in more details the general approach on how to cope with the annotation of language data within ISO.
Lexical resources are often represented in table form, e. g., in relational databases, or represented in specially marked up texts, for example, in document based XML models. This paper describes how it is possible to model lexical structures as graphs and how this model can be used to exploit existing lexical resources and even how different types of lexical resources can be combined.
Lexicon schemas and their use are discussed in this paper from the perspective of lexicographers and field linguists. A variety of lexicon schemas have been developed, with goals ranging from computational lexicography (DATR) through archiving (LIFT, TEI) to standardization (LMF, FSR). A number of requirements for lexicon schemas are given. The lexicon schemas are introduced and compared to each other in terms of conversion and usability for this particular user group, using a common lexicon entry and providing examples for each schema under consideration. The formats are assessed and the final recommendation is given for the potential users, namely to request standard compliance from the developers of the tools used. This paper should foster a discussion between authors of standards, lexicographers and field linguists.
Lexicography
(2008)
This paper discusses work on the sustainability of linguistic resources as it was conducted in various projects, including the work of a three year project Sustainability of Linguistic Resources which finished in December 2008, a follow-up project, Sustainable linguistic data, and initiatives related to the work of the International Organization of Standardization (ISO) on developing standards for linguistic resources. The individual projects have been conducted at German collaborative research centres at the Universities of Potsdam, Hamburg and Tübingen, where the sustainability work was coordinated.
The chapter on formats and models for lexicons deals with different available data formats of lexical resources. It elaborates on their structure and possible uses. Motivated by the restrictions in merging different lexical resources based on widely spread formalisms and international standards, a formal lexicon model for lexical resources is developed which is related to graph structures in annotations. For lexicons this model is termed the Lexicon Graph. Within this model the concepts of lexicon entries and lexical structures frequently described in the literature are formally defined and examples are given. The article addresses the problem of ambiguity in those formal terms. An implementation based on XML and XML technology such as XQuery for the defined structures is given. The relation to international standards is included as well.
Wenn man verschiedenartige Forschungsdaten über Metadaten inhaltlich beschreiben möchte, sind bibliografische Angaben allein nicht ausreichend. Vielmehr benötigt man zusätzliche Beschreibungsmittel, die der Natur und Komplexität gegebener Forschungsressourcen Rechnung tragen. Verschiedene Arten von Forschungsdaten bedürfen verschiedener Metadatenprofile, die über gemeinsame Komponenten definiert werden. Solche Forschungsdaten können gesammelt (z.B. über OAI-PMH-Harvesting) und mittels Facetten-basierter Suche über eine einheitliche Schnittstelle exploriert werden. Der beschriebene Anwendungskontext kann über sprachwissenschaftliche Daten hinaus verallgemeinert werden.
This paper uses a devil’s advocate position to highlight the benefits of metadata creation for linguistic resources. It provides an overview of the required metadata infrastructure and shows that this infrastructure is in the meantime developed by various projects and hence can be deployed by those working with linguistic resources and archiving. Possible caveats of metadata creation are mentioned starting with user requirements and backgrounds, contribution to academic merits of researchers and standardisation. These are answered with existing technologies and procedures, referring to the Component Metadata Infrastructure (CMDI). CMDI provides an infrastructure and methods for adapting metadata to the requirements of specific classes of resources, using central registries for data categories, and metadata schemas. These registries allow for the definition of metadata schemas per resource type while reusing groups of data categories also used by other schemas. In summary, rules of best practice for the creation of metadata are given.
XML has been designed for creating structured documents, but the information that is encoded in these structures are, by definition, out of scope for XML. Additional sources, normally not easily interpretable by computers, such as documentation are needed to determine the intention of specific tags in a tag-set. The Component Metadata Infrastructure (CMDI) takes a rather pragmatic approach to foster interoperability between XML instances in the domain of metadata descriptions for language resources. This paper gives an overview of this approach.