Refine
Year of publication
- 2011 (32) (remove)
Document Type
- Conference Proceeding (16)
- Article (13)
- Part of a Book (1)
- Review (1)
- Working Paper (1)
Has Fulltext
- yes (32)
Keywords
- Computerlinguistik (8)
- Deutsch (7)
- Datenmanagement (4)
- Metadaten (4)
- Maschinelles Lernen (3)
- Sentimentanalyse (3)
- Annotation (2)
- Automatische Sprachanalyse (2)
- Computerunterstützte Lexikografie (2)
- Englisch (2)
Publicationstate
- Veröffentlichungsversion (22)
- Zweitveröffentlichung (6)
- Postprint (2)
- (Verlags)-Lektorat (1)
Reviewstate
- Peer-Review (32) (remove)
Publisher
- GSCL (2)
- Incoma Ltd. (2)
- Trojina, Institute for Applied Slovene Studies (2)
- Universität Hamburg - Sonderforschungsbereich 538 (2)
- Association for Computational Linguistics (1)
- Benjamins (1)
- Centre de linguistique appliquée (1)
- City University of Hong Kong (1)
- Dipartimento di Linguistica, Università di Pisa (1)
- Editorial Universitat Politècnica de València (1)
This article looks at Latgalian from a perspective of a classification of languages. It starts by discussing relevant terms relating to sociolinguistic language types. It argues that Latgalian and its speakers show considerable similarities with many languages in Europe which are considered to be regional languages – hence, also Latgalian should be classified as such. In a second part, the article uses sociolinguistic data to indicate that the perceptions of speakers confirm this classification. Therefore, Latgalian should also officially be treated with the respect that other regional languages in Europe enjoy.
An interactive, dynamic electronic dictionary aimed at text production should guide the user in innovative ways, especially in respect of difficult, complicated or confusing issues. This paper proposes a design for bilingual dictionaries intended to guide users in text production; we focus on complex phenomena of the interaction between lexis and grammar. It will be argued that a dictionary aimed at guiding the user in lexical selection should implement a type of “decision algorithm”. In addition, it should flag incorrect solutions and should warn against possible wrong generalisations of (foreign) language learners. Our proposals will be illustrated with examples from several languages, as the design principles are generally applicable. The copulative construction which is regarded as the most complicated grammatical structure in Northern Sotho will be analyzed in more detail and presented as a case in point.
Between classical symbolic word sense disambiguation (wsd) using explicit deep semantic representations of sentences and texts and statistical wsd using word co-occurrence information, there is a recent tendency towards mediating methods. Similar to so-called lightweight semantics (Marek, 2009) we suggest to only make sparse use of semantic information. We describe an approximation model based upon flat underspecified discourse representation structures (FUDRSs, cf. Eberle, 2004) that weighs knowledge about context structure, lexical semantic restrictions and interpretation preferences. We give a catalogue of guidelines for human annotation of texts by corresponding indicators. Using this, the reliability of an analysis tool that implements the model can be tested with respect to annotation precision and disambiguation prediction and how both can be improved by bootstrapping the knowledge of the system using corpus information. For the balanced test corpus considered the recognition rate of the preferred reading is 80-90% (depending on the smoothing of parse errors).
This paper aims at contributing to the analysis of overlaps in turns-at-talk from both a sequential and a multimodal perspective. Overlaps have been studied within Conversation Analysis by focusing mainly on verbal and vocal resources; taking into account multimodal resources such as gesture, bodily posture, and gaze contributes to a better understanding of participants’ orientations to the sequential organization of overlapping talk and their management of speakership. First, we introduce the way in which overlaps have been studied in Conversation Analysis, mainly by Jefferson (1973, 1983, 2004) and Schegloff (2000); then we propose possible implications of their multimodal analysis. In order to demonstrate that speakers systematically orient to the overlap onset and resolution we analyze the multimodal conduct of overlapped speakers. Findings show methodical variations in trajectories of overlap resolution: speakers’ gestures in overlap display themselves as maintaining or withdrawing their turn, thereby exhibiting the speakership achieved and negotiated during overlap.
This paper offers a detailed analysis of the opening of an international meeting. English Lingua Franca as the official language of the meeting is actively discussed and negotiated by the participants. The analysis highlights the issues identified by the participants themselves in choosing a linguistic regime for their professional exchanges. The English Lingua Franca regime is aimed at facilitating the participation of some of the participants, but creates problems for others, too. The chairman deals with this situation in an embodied way (through his gaze, gesture, bodily postures, and by the way in which he walks through the room), displaying that he orients to different member categories (such as 'anglophone', 'anglophone who can understand French', 'francophile', etc.) as benefitting from or resisting against the definitive language choice.
Linguistics is facing the challenge of many other sciences as it continues to grow into increasingly complex subfields, each with its own separate or overarching branches. While linguists are certainly aware of the overall structure of the research field, they cannot follow all developments other than those of their subfields. It is thus important to help specialists but also newcomers alike to bushwhack through evolved or unknown territory of linguistic data. A considerable amount of research data in linguistics is described with metadata. While studies described and published in archived journals and conference proceedings receive a quite homogeneous set of metadata tags — e.g., author, title, publisher —, this does not hold for the empirical data and analyses that underlie such studies. Moreover, lexicons, grammars, experimental data, and other types of resources come in different forms; and to make things worse, their description in terms of metadata is also not uniform, if existing at all. These problems are well-known and there are now a number of international initiatives — e.g., CLARIN, FlareNet, MetaNet, DARIAH — to build infrastructures for managing linguistic resources. The NaLiDa project, funded by the German Research Foundation, aims at facilitating the management and access to linguistic resources originating from German research institutions. In cooperation with the German SFB 833 research center, we are developing a combination of faceted and full-text search to give integrated access through heterogeneous metadata sets. Our approach is supported by a central registry for metadata field descriptors, and a component repository for structured groups of data categories as larger building blocks.
This paper uses a devil’s advocate position to highlight the benefits of metadata creation for linguistic resources. It provides an overview of the required metadata infrastructure and shows that this infrastructure is in the meantime developed by various projects and hence can be deployed by those working with linguistic resources and archiving. Possible caveats of metadata creation are mentioned starting with user requirements and backgrounds, contribution to academic merits of researchers and standardisation. These are answered with existing technologies and procedures, referring to the Component Metadata Infrastructure (CMDI). CMDI provides an infrastructure and methods for adapting metadata to the requirements of specific classes of resources, using central registries for data categories, and metadata schemas. These registries allow for the definition of metadata schemas per resource type while reusing groups of data categories also used by other schemas. In summary, rules of best practice for the creation of metadata are given.
Wenn man verschiedenartige Forschungsdaten über Metadaten inhaltlich beschreiben möchte, sind bibliografische Angaben allein nicht ausreichend. Vielmehr benötigt man zusätzliche Beschreibungsmittel, die der Natur und Komplexität gegebener Forschungsressourcen Rechnung tragen. Verschiedene Arten von Forschungsdaten bedürfen verschiedener Metadatenprofile, die über gemeinsame Komponenten definiert werden. Solche Forschungsdaten können gesammelt (z.B. über OAI-PMH-Harvesting) und mittels Facetten-basierter Suche über eine einheitliche Schnittstelle exploriert werden. Der beschriebene Anwendungskontext kann über sprachwissenschaftliche Daten hinaus verallgemeinert werden.
XML has been designed for creating structured documents, but the information that is encoded in these structures are, by definition, out of scope for XML. Additional sources, normally not easily interpretable by computers, such as documentation are needed to determine the intention of specific tags in a tag-set. The Component Metadata Infrastructure (CMDI) takes a rather pragmatic approach to foster interoperability between XML instances in the domain of metadata descriptions for language resources. This paper gives an overview of this approach.
Mechanism-based thinking on policy diffusion. A review of current approaches in political science
(2011)
Despite theoretical and methodological progress in what is now coined as the third generation of diffusion studies, explicitly dealing with the causal mechanisms underlying diffusion processes and comparatively analyzing them is only of recent date. As a matter of fact, diffusion research has ended up in a diverse and often unconnected array of theoretical assumptions relying both on rational as well as constructivist reasoning – a circumstance calling for more theoretical coherence and consistency. Against this backdrop, this paper reviews and streamlines diffusion literature in political science. Diffusion mechanisms largely cluster around two causal arguments determining the desires and preferences of actors for choosing alternative policies. First, existing diffusion mechanisms accounts can be grouped according to the rationality for policy adoption, this means that government behavior is based on the instrumental considerations of actors or on constructivist arguments like norms and rule-driven actors. Second, diffusion mechanisms can either directly impact on the beliefs of actors or they might influence the structural conditions for decision-making. Following this logic, four basic diffusion mechanisms can be identified in mechanism-based thinking on policy diffusion: emulation, socialization, learning, and externalities.
This paper demonstrates systematic cross-linguistic differences in the electrophysiological correlates of conflicts between form and meaning (“semantic reversal anomalies”). These engender P600 effects in English and Dutch (e.g. Kolk et al., 2003, Kuperberg et al., 2003), but a biphasic N400 – late positivity pattern in German (Schlesewsky and Bornkessel-Schlesewsky, 2009), and monophasic N400 effects in Turkish (Experiment 1) and Mandarin Chinese (Experiment 2). Experiment 3 revealed that, in Icelandic, semantic reversal anomalies show the English pattern with verbs requiring a position-based identification of argument roles, but the German pattern with verbs requiring a case-based identification of argument roles. The overall pattern of results reveals two separate dimensions of cross-linguistic variation: (i) the presence vs. absence of an N400, which we attribute to cross-linguistic differences with regard to the sequence-dependence of the form-to-meaning mapping and (ii) the presence vs. absence of a late positivity, which we interpret as an instance of a categorisation-related late P300, and which is observable when the language under consideration allows for a binary well-formedness categorisation of reversal anomalies. We conclude that, rather than reflecting linguistic domains such as syntax and semantics, the late positivity vs. N400 distinction is better understood in terms of the strategies that serve to optimise the form-to-meaning mapping in a given language.
This paper discusses the technological and methodological challenges in creating and sharing HAMATAC, the Hamburg Map Task Corpus. The first version of the corpus, consisting of 24 recordings with orthographic transcriptions and metadata, is publicly available. A second version featuring different types of linguistic annotation is in progress. I will describe how the various software tools and data formats of the EXMARaLDA system were used for transcription and multi-level annotation, to compile recordings and transcriptions into a corpus and manage metadata, to publish the corpus, and how they can be used for carrying out corpus queries (KWIC) and analyses. Some recurrent issues in corpus building and sharing and the interaction of technological and methodological aspects will be illustrated using HAMATAC.
Solo di recente le tipologie testuali turistiche, generi di testo che svolgono un ruolo molto importante nella comunicazione specialistica, sono diventate oggetto di interesse per gli studi linguistici. L’articolo presenta gli esiti di un’analisi contrastiva (tedesco-italiano) di cataloghi turistici dal punto di vista microstrutturale, con particolare attenzione per la sintassi, il lessico e i mezzi stilistici più frequentemente utilizzati. L’indagine palesa come i cataloghi siano una tipologia testuale che si presta a molteplici applicazioni sia in didattica che in lessicografia.
The planning of a dictionary should consider both theoretical and empiric aspects, either for its macro- and microstructure: this is true also for Online Specialized Dictionaries of Linguistics. In particular the microstructure should be standardized and structured so as to fit with the primary and secondary functions of a dictionary. Unfortunately, empirical studies that investigate Online Specialized Dictionaries of Linguistics are rare, making it unclear which microstructural elements are obligatory and which are facultative. This article will present and comment upon the results of an investigation into a corpus of Online Specialized Dictionaries of Linguistics, focusing attention on these aspects and also the most important theoretical issues. An example taken from DIL, a German-Italian Online Dictionary of Linguistics, will end the article.
DIL ist ein deutsch-italienisches Online-Fachwörterbuch der Linguistik. Es ist ein offenes Wörterbuch und mit diesem Beitrag wird für eine mögliche Zusammenarbeit, Kollaboration plädiert. DIL ist noch im Aufbau begriffen; zur Zeit ist nur die Sektion DaF komplett veröffentlicht, auch wenn andere Sektionen in Bearbeitung sind. Die Sektion LEX (Lexikographie), die zur Veröffentlichung ansteht, wird zusammen mit den wichtigsten Eigenschaften des Wörterbuches präsentiert.
In this paper, we explore different linguistic structures encoded as convolution kernels for the detection of subjective expressions. The advantage of convolution kernels is that complex structures can be directly provided to a classifier without deriving explicit features. The feature design for the detection of subjective expressions is fairly difficult and there currently exists no commonly accepted feature set. We consider various structures, such as constituency parse structures, dependency parse structures, and predicate-argument structures. In order to generalize from lexical information, we additionally augment these structures with clustering information and the task-specific knowledge of subjective words. The convolution kernels will be compared with a standard vector kernel.
In order to automatically extract opinion holders, we propose to harness the contexts of prototypical opinion holders, i.e. common nouns, such as experts or analysts, that describe particular groups of people whose profession or occupation is to form and express opinions towards specific items. We assess their effectiveness in supervised learning where these contexts are regarded as labelled training data and in rule-based classification which uses predicates that frequently co-occur with mentions of the prototypical opinion holders. Finally, we also examine in how far knowledge gained from these contexts can compensate the lack of large amounts of labeled training data in supervised learning by considering various amounts of actually labeled training sets.
In this paper, we investigate the role of predicates in opinion holder extraction. We will examine the shape of these predicates, investigate what relationship they bear towards opinion holders, determine what resources are potentially useful for acquiring them, and point out limitations of an opinion holder extraction system based on these predicates. For this study, we will carry out an evaluation on a corpus annotated with opinion holders. Our insights are, in particular, important for situations in which no labelled training data are available and only rule-based methods can be applied.
We introduce a system that learns the participants of arbitrary given scripts. This system processes data from web experiments, in which each participant can be realized with different expressions. It computes participants by encoding semantic similarity and global structural information into an Integer Linear Program. An evaluation against a gold standard shows that we significantly outperform two informed baselines.
Semantic argument structures are often incomplete in that core arguments are not locally instantiated. However, many of these implicit arguments can be linked to referents in the wider context. In this paper we explore a number of linguistically motivated strategies for identifying and resolving such null instantiations (NIs). We show that a more sophisticated model for identifying definite NIs can lead to noticeable performance gains over the state-of-the- art for NI resolution.
Active Learning (AL) has been proposed as a technique to reduce the amount of annotated data needed in the context of supervised classification. While various simulation studies for a number of NLP tasks have shown that AL works well on goldstandard data, there is some doubt whether the approach can be successful when applied to noisy, real-world data sets. This paper presents a thorough evaluation of the impact of annotation noise on AL and shows that systematic noise resulting from biased coder decisions can seriously harm the AL process. We present a method to filter out inconsistent annotations during AL and show that this makes AL far more robust when applied to noisy data.
Prominence has been widely studied on the word level and the syllable level. An extensive study comparing the two approaches is missing in the literature. This study investigates how word and syllable prominence relate to each other in German. We find that perceptual ratings based on the word level are more extreme than those based on the syllable level. The correlations between word prominence and acoustic features are greater than the correlations between syllable prominence and acoustic features.
Streefkerk defines prominence as the perceptually outstanding parts in spoken language. An optimal rating scale for syllable prominence has not been found yet. This paper evaluates a 4-point, an 11-point, a 31-point, and a continuous scale for the rating of syllable prominence and gives support for scales using a higher number of levels. Priming effects found by Arnold, et al., could only be replicated using the 31-point scale.
In diesem Beitrag befassen wir uns mit Aspekten der textuellen Verwendung von Possessiva im Deutschen, im Polnischen und im Ungarischen, die wir aus ihrem jeweiligen Formensystem und dessen Einbettung in das entsprechende Sprachsystem zu erklären suchen. Im Mittelpunkt des Beitrags stehen Possessiva mit anaphorischen Bezügen, die in deutsch-, polnisch- und ungarischsprachigen Texten die Possessiva der 3. Person betreffen. Wir widmen uns insbesondere folgenden drei Fragen: (i) Welcher Formunterscheidungen bedienen sich das Deutsche, das Polnische und das Ungarische beim Gebrauch der Possessiva, um die Identifikation des richtigen Bezugsausdrucks im Text zu ermöglichen? (ii) Wie lassen sich die jeweiligen Formentscheidungen in den betreffenden Kontexten erklären? (iii) Welche textuelle Wirkung wird durch die Wahl der jeweiligen Formen erreicht? Diese Fragen werden auf Grund der durchgeführten empirischen Paralleltextanalysen beantwortet.
Editorial
(2011)
Relationale Adjektive, also Adjektive, die aus Substantiven abgeleitet werden und die in attributiver Konstruktion mit einem Kopfsubstantiv eine unspezifische Relation zwischen dem Begriff des Kopfs und dem Begriff der Basis ausdrücken, spielen in den klassischen Sprachen eine bedeutende Rolle. Ausgehend von der silvestris musa, der Waldmuse des Vergil, wird in dem vorliegenden Beitrag den Nachwirkungen dieses Musters in europäischen Sprachen, dem Französischen, Englischen, vor allem aber im Deutschen nachgegangen. Die semantische Funktion solcher Adjektive wird der funktionalen Domäne ‚klassifikatorische Modifikation‘ zugeordnet. Sprachübergreifende Gemeinsamkeiten und Unterschiede werden herausgearbeitet. In knapper Form werden auch relationale Adjektive im Polnischen und Ungarischen, den weiteren Vergleichssprachen des Projekts „Grammatik des Deutschen im europäischen Vergleich“, einbezogen. Die Frage nach dem Verhältnis von universalen, sprachfamiliären, arealen und sprachspezifischen Eigenschaften des Konstruktionsmusters sowie nach dem Grad des lateinischen Einflusses wird auf diesem Hintergrund präziser formulierbar.
How to propose an action as an objective necessity. The case of Polish trzeba x (‘one needs to x’)
(2011)
The present study demonstrates that language-specific grammatical resources can afford speakers language-specific ways of organizing cooperative practical action. On the basis of video recordings of Polish families in their homes, we describe action affordances of the Polish impersonal modal declarative construction trzeba x (“one needs to x”) in the accomplishment of everyday domestic activities, such as cutting bread, bringing recalcitrant children back to the dinner table, or making phone calls. Trzeba-x turns in first position are regularly chosen by speakers to point to a possible action as an evident necessity for the furthering of some broader ongoing activity. Such turns in first position provide an environment in which recipients can enact shared responsibility by actively involving themselves in the relevant action. Treating the necessity as not restricted to any particular subject, aligning responsive actions are oriented to when the relevant action will be done, not whether it will be done. We show that such sequences are absent from English interactions by analyzing (a) grammatically similar turn formats in English interaction (“we need to x,” “the x needs to y”), and (b) similar interactive environments in English interactions. We discuss the potential of this research to point to a new avenue for researchers interested in the relationship between language diversity and diversity in human action and cognition.