Refine
Year of publication
- 2012 (44) (remove)
Document Type
- Conference Proceeding (21)
- Article (20)
- Part of a Book (2)
- Part of Periodical (1)
Has Fulltext
- yes (44)
Is part of the Bibliography
- no (44)
Keywords
- Computerlinguistik (9)
- Metadaten (7)
- Datenmanagement (5)
- Deutsch (5)
- Infrastruktur (5)
- Korpus <Linguistik> (5)
- Forschung (4)
- Information Extraction (4)
- Multimodalität (4)
- Natürliche Sprache (4)
Publicationstate
- Veröffentlichungsversion (31)
- Zweitveröffentlichung (9)
- Postprint (4)
Reviewstate
- Peer-Review (44) (remove)
Publisher
- European Language Resources Association (8)
- European Centre for Minority Issues (3)
- Association for Computational Linguistics (2)
- International Speech Communications Association (2)
- Springer (2)
- de Gruyter (2)
- Buske (1)
- CLARIN-D (1)
- Eigenverlag ÖGAI (1)
- European Language Resources Association (ELRA) (1)
This special issue of the Journal on Ethnopolitics and Minority Issues in Europe (JEMIE) brings together some of the participants of the symposium Political and Economic Resources and Obstacles of Minority Language Maintenance organized by the Language Survival Network ‘POGA’ at Tallinn University, Estonia, in December 2010. More than 20 scholars representing linguistics, anthropology, social sciences and law participated in the symposium, to present papers and discuss questions related to minority language loss, maintenance and revitalization. The six case studies contained in this special issue look at different minorities and regions in the European Union, Russia and the US. The linguistic communities discussed are the Russian-, Võru/Seto- and Latgalian-speaking minorities of Estonia and Latvia; the Welsh- and Breton-speaking communities of the Celtic language; the Russian Finno-Ugrian people with regional autonomies; and the native American groups of the Delaware/Cherokee and the Oneida. The reader will find articles relating to interdisciplinary research approaches in and on minority languages and minority language communities.
Towards a part-of-speech ontology: encoding morphemic units of two South African Bantu languages
(2012)
This article describes the design of an electronic knowledge base, namely a morpho-syntactic database structured as an ontology of linguistic categories, containing linguistic units of two related languages of the South African Bantu group: Northern Sotho and Zulu. These languages differ significantly in their surface orthographies, but are very similar on the lexical and sub-lexical levels. It is therefore our goal to describe the morphemes of these languages in a single common database in order to outline and interpret commonalities and differences in more detail. Moreover, the relational database which is developed defines the underlying morphemic units (morphs) for both languages. It will be shown that the electronic part-of-speech ontology goes hand in hand with part-of-speech tagsets that label morphemic units. This database is designed as part of a forthcoming system providing lexicographic and linguistic knowledge on the official South African Bantu languages.
Electronic dictionaries should support dictionary users by giving them guidance in text production and text reception, alongside a user-definable offer of lexicographic data for cognitive purposes. In this article, we sketch the principles of an interactive and dynamic electronic dictionary aimed at text production and text reception guiding users in innovative ways, especially with respect to difficult, complicated or confusing issues. The lexicographer has to do a very careful analysis of the nature of the possible problems to suggest an optimal solution for a specific problem. We are of the opinion that there are numerous complex situations where users need more detailed support than currently available in e-dictionaries, enabling them to make valid and correct choices. For highly complex situations, we suggest guidance through a decision tree-like device. We assume that the solutions proposed here are not specific to one language only but can, after careful analysis, be applied to e-dictionaries in different languages across the world.
Over the past decades, problems related to linguistic minorities and their well-being, as well as to minority languages and their maintenance, have developed as an independent branch of minority studies. Studies of language in society and sociolinguistics, strategies of minority language survival and the empowerment of their speakers have produced a considerable output of case studies and theoretical writings.In this multifaceted field of investigation, language use, language practices, language policies and language politics represent interrelated aspects of social and linguistic relations that cannot be meaningfully addressed from a point of view of one scientific discipline only. This is specially the case when one wants to understand processes of language loss and maintenance, or the revitalization and empowerment of a language community. Such processes are linguistic expressions of complex social settings, and reflect group and individual identities that in turn express changing systems of collective values, human networks, fashions and social practices.
This article discusses the situation of the Latgalian language in Latvia today. It first provides an overview of languages in Latvia, followed by a historical and contemporary sketch of the societal position of Latgalian and by an account of current Latgalian language activism. On this basis, the article then applies schemes of language functions and of evaluations of the societal position of minority languages to Latgalian. Given the range of functions that Latgalian fulfils today and the wishes and attempts by activists to expand these functions, the article argues that it is surprising that so little attention is given to Latgalian in mainstream Latvian and international sociolinguistic publications. In this light, the fate of the language is difficult to prognose, but a lot depends on whether the Latvian state will clarify its own unclear perception of policies towards Latgalian and on how much attention it will receive in the future.
In this chapter, I will focus on the phenomenon of drop out, i.e., withdrawal from the turn due to overlapping talk, in order to reflect on the link between “unfinished” turns and participation framework. With the help of a sequential and multimodal analysis inspired by the conversation analytical approach, I will show that dropping out from a turn is strongly linked to the availability displayed by potential recipients of a turn-at-talk. Although conversation analysis has described in detail the systematics of overlapping talk, especially of its onset (Jefferson 1973, 1983, 1986) and its resolution (Scheg-loff 2000; Jefferson 2004), the phenomenon of withdrawal from a turn due to simultaneous talk has not been investigated in detail. While it seems to bedifficult to describe this interactional practice by referring exclusively to syntactic features (incompleteness of the turn), I suggest looking at turn withdrawal from a multimodal perspective (e.g. Goodwin 1980, 1981; Mondada2007a; Schmitt 2005), taking into account visible resources like gaze or gesture. The problem of continuing or stopping a turn-in-progress in overlapping talk can be closely linked to the participation framework (Goodwin and Goodwin 2004), as speakers do visibly take into account their recipient’s availability and coordinate their turn construction with the dynamic changes of the participation framework and the interactional space.
The current state of the art for metadata provision allows for a very flexible approach, catering for the needs of different archives and communities, referring to common data category registries that describe the meaning of a data category at least to authors of metadata. Component models for metadata provisions are for example used by CLARIN and META-SHARE, but there is also an increased flexibility in other metadata schemas such as Dublin Core, which is usually not seen as appropriate for meaningful description of language resources.
Making resources available for others and putting this to a second use in other projects has never been more widely accepted as a sensible efficient way to avoid a waste of efforts and resources. However, when it comes to the details, there is still a vast number of problems. This workshop has aimed at being a forum to address issues and challenges in the concrete work with metadata for LRs, not restricted to a single initiative for archiving LRs. It has allowed for exchange and discussion and we hope that the reader finds the articles here compiled interesting and useful.
This paper describes the ongoing work to integrate WebLicht into the CLARIN infrastructure. It introduces the CLARIN infrastructure for scholars in the humanities and social sciences as well as WebLicht - an orchestration and execution environment that is built upon Service Oriented Architecture principles. The integration of WebLicht into the CLARIN infrastructure involves adapting it to the standards and practices used within CLARIN, including distributed repositories, CMDI metadata, and persistent identifiers.
The ISOcat registry reloaded
(2012)
The linguistics community is building a metadata-based infrastructure for the description of its research data and tools. At its core is the ISOcat registry, a collaborative platform to hold a (to be standardized) set of data categories (i.e., field descriptors). Descriptors have definitions in natural language and little explicit interrelations. With the registry growing to many hundred entries, authored by many, it is becoming increasingly apparent that the rather informal definitions and their glossary-like design make it hard for users to grasp, exploit and manage the registry’s content. In this paper, we take a large subset of the ISOcat term set and reconstruct from it a tree structure following the footsteps of schema.org. Our ontological re-engineering yields a representation that gives users a hierarchical view of linguistic, metadata-related terminology. The new representation adds to the precision of all definitions by making explicit information which is only implicitly given in the ISOcat registry. It also helps uncovering and addressing potential inconsistencies in term definitions as well as gaps and redundancies in the overall ISOcat term set. The new representation can serve as a complement to the existing ISOcat model, providing additional support for authors and users in browsing, (re-)using, maintaining, and further extending the community’s terminological metadata repertoire.
Creating and maintaining metadata for various kinds of resources requires appropriate tools to assist the user. The paper presents the metadata editor ProFormA for the creation and editing of CMDI (Component Metadata Infrastructure) metadata in web forms. This editor supports a number of CMDI profiles currently being provided for different types of resources. Since the editor is based on XForms and server-side processing, users can create and modify CMDI files in their standard browser without the need for further processing. Large parts of ProFormA are implemented as web services in order to reuse them in other contexts and programs.
This paper presents the system architecture as well as the underlying workflow of the Extensible Repository System of Digital Objects (ERDO) which has been developed for the sustainable archiving of language resources within the Tübingen CLARIN-D project. In contrast to other approaches focusing on archiving experts, the described workflow can be used by researchers without required knowledge in the field of long-term storage for transferring data from their local file systems into a persistent repository.
The paper’s purpose is to give an overview of the work on the Component Metadata Infrastructure (CMDI) that was implemented in the CLARIN research infrastructure. It explains, the underlying schema, the accompanying tools and services. It also describes the status and impact of the CMDI developments done within the CLARIN project and past and future collaborations with other projects.
The Component Metadata Infrastructure (CMDI) in a project on sustainable linguistic resources
(2012)
The sustainable archiving of research data for predefined time spans has become increasingly important to researchers and is stipulated by funding organizations with the obligatory task of being observed by researchers. An important aspect in view of such a sustainable archiving of language resources is the creation of metadata, which can be used for describing, finding and citing resources. In the present paper, these aspects are dealt with from the perspectives of two projects: the German project for Sustainability of Linguistic Data at the University of Tubingen (NaLiDa, cf. http://www.sfs.uni-tuebingen.de/nalida) and the Dutch-Flemish HLT Agency hosted at the Institute for Dutch Lexicology (TST-Centrale, cf.http://www.inl.nl/tst-centrale). Both projects unfold their approaches to the creation of components and profiles using the Component Metadata Infrastructure (CMDI) as underlying metadata schema for resource descriptions, highlighting their experiences as well as advantages and disadvantages in using CMDI.
This paper describes the status of the standardization efforts of a Component Metadata approach for describing Language Resources with metadata. Different linguistic and Language & Technology communities as CLARIN, META-SHARE and NaLiDa use this component approach and see its standardization of as a matter for cooperation that has the possibility to create a large interoperable domain of joint metadata. Starting with an overview of the component metadata approach together with the related semantic interoperability tools and services as the ISOcat data category registry and the relation registry we explain the standardization plan and efforts for component metadata within ISO TC37/SC4. Finally, we present information about uptake and plans of the use of component metadata within the three mentioned linguistic and L&T communities.
In two eye-tracking experiments, we investigated the relationship between the subject preference in the resolution of subject-object ambiguities in German embedded clauses and semantic word order constraints (i.e., prominence hierarchies relating to the specificity/referentiality of noun phrases, case assignment and thematic role assignment). Our central research question concerned the timecourse with which prominence information is used and particularly whether it modulates the subject preference. In both experiments, we replicated previous findings of reanalysis effects for object-initial structures. Our findings further suggest that noun phrase prominence does not alter initial parsing strategies (viz., the subject preference), but rather modulates the ease of later reanalysis processes. In Experiment 1, the object case assigned by the verb did not affect the ease of reanalysis. However, the syntactic reanalysis was rendered more difficult when the order of the two arguments violated the specificity/referentiality hierarchy. Experiment 2 revealed that the initial subject preference also holds for verbs favoring an object-initial base order (i.e., dative object-experiencer verbs). However, the advantage for subject-initial sentences is neutralized in relatively late processing stages when the thematic role hierarchy and the specificity hierarchy converge to promote scrambling.
Knowledge Acquisition with Natural Language Processing in the Food Domain: Potential and Challenges
(2012)
In this paper, we present an outlook on the effectiveness of natural language processing (NLP) in extracting knowledge for the food domain. We identify potential scenarios that we think are particularly suitable for NLP techniques. As a source for extracting knowledge we will highlight the benefits of textual content from social media. Typical methods that we think would be suitable will be discussed. We will also address potential problems and limits that the application of NLP methods may yield.
In this paper, we describe MLSA, a publicly available multi-layered reference corpus for German-language sentiment analysis. The construction of the corpus is based on the manual annotation of 270 German-language sentences considering three different layers of granularity. The sentence-layer annotation, as the most coarse-grained annotation, focuses on aspects of objectivity, subjectivity and the overall polarity of the respective sentences. Layer 2 is concerned with polarity on the word- and phrase-level, annotating both subjective and factual language. The annotations on Layer 3 focus on the expression-level, denoting frames of private states such as objective and direct speech events. These three layers and their respective annotations are intended to be fully independent of each other. At the same time, exploring for and discovering interactions that may exist between different layers should also be possible. The reliability of the respective annotations was assessed using the average pairwise agreement and Fleiss’ multi-rater measures. We believe that MLSA is a beneficial resource for sentiment analysis research, algorithms and applications that focus on the German language.
We present a gold standard for semantic relation extraction in the food domain for German. The relation types that we address are motivated by scenarios for which IT applications present a commercial potential, such as virtual customer advice in which a virtual agent assists a customer in a supermarket in finding those products that satisfy their needs best. Moreover, we focus on those relation types that can be extracted from natural language text corpora, ideally content from the internet, such as web forums, that are easy to retrieve. A typical relation type that meets these requirements are pairs of food items that are usually consumed together. Such a relation type could be used by a virtual agent to suggest additional products available in a shop that would potentially complement the items a customer has already in their shopping cart. Our gold standard comprises structural data, i.e. relation tables, which encode relation instances. These tables are vital in order to evaluate natural language processing systems that extract those relations.
In this paper, we examine methods to automatically extract domain-specific knowledge from the food domain from unlabeled natural language text. We employ different extraction methods ranging from surface patterns to co-occurrence measures applied on different parts of a document. We show that the effectiveness of a particular method depends very much on the relation type considered and that there is no single method that works equally well for every relation type. We also examine a combination of extraction methods and also consider relationships between different relation types. The extraction methods are applied both on a domain-specific corpus and the domain-independent factual knowledge base Wikipedia. Moreover, we examine an open-domain lexical ontology for suitability.
In this paper, we compare three different generalization methods for in-domain and cross-domain opinion holder extraction being simple unsupervised word clustering, an induction method inspired by distant supervision and the usage of lexical resources. The generalization methods are incorporated into diverse classifiers. We show that generalization causes significant improvements and that the impact of improvement depends on the type of classifier and on how much training and test data differ from each other. We also address the less common case of opinion holders being realized in patient position and suggest approaches including a novel (linguistically-informed) extraction method how to detect those opinion holders without labeled training data as standard datasets contain too few instances of this type.