Refine
Year of publication
- 2008 (62) (remove)
Document Type
- Conference Proceeding (29)
- Part of a Book (16)
- Article (13)
- Doctoral Thesis (3)
- Working Paper (1)
Language
- English (62) (remove)
Is part of the Bibliography
- no (62)
Keywords
- Deutsch (15)
- Korpus <Linguistik> (9)
- Annotation (5)
- Automatische Sprachanalyse (4)
- Computerlinguistik (4)
- Computerunterstützte Lexikographie (4)
- Englisch (4)
- Gesprochene Sprache (4)
- Mehrsprachigkeit (4)
- Computerunterstützte Lexikografie (3)
Publicationstate
- Veröffentlichungsversion (33)
- Postprint (7)
- Zweitveröffentlichung (4)
Reviewstate
Publisher
- European Language Resources Association (ELRA) (7)
- de Gruyter (4)
- ELRA (3)
- University of Oulu (3)
- Academia (2)
- Benjamins (2)
- European Language Resources Association (2)
- Aisthesis (1)
- BBAW (1)
- CSLI (1)
This paper presents three electronic collections of polarity items: (i) negative polarity items in Romanian, (ii) negative polarity items in German, and (iii) positive polarity items in German. The presented collections are a part of a linguistic resource on lexical units with highly idiosyncratic occurrence patterns. The motivation for collecting and documenting polarity items was to provide a solid empirical basis for linguistic investigations of these expressions. Our databe provides general information about the collected items, specifies their syntactic properties, and describes the environment that licenses a given item. For each licensing context, examples from various corpora and the Internet are introduced. Finally, the type of polarity (negative or positive) and the class (superstrong, strong, weak or open) associated with a given item is speci ed. Our database is encoded in XML and is available via the Internet, offering dynamic and exible access.
The authors present a multilingual electronic database of lexical items with idiosyncratic occurrence patterns. Currently, our database consists of: (1) a collection of 444 bound words in German; (2) a collection of 77 bound words in English; (3) a collection of 58 negative polarity items in Romanian; (4) a collection of 84 negative polarity items in German; and (5) a collection of 52 positive polarity items in German. The database is encoded in XML and is available via the Internet, offering dynamic and flexible access.
One of the most popular techniques used in HPSG-based studies to describe linguistic phenomena is the raising mechanism. Besides ordinary raising verbs or adjectives, this tool has been applied for handling verbal complexes and discontinuous constituents, among other phenomena. In this paper, a new application for raising within the HPSG paradigm will be discussed, thereby investigating data from the prepositional domain. We will analyze linguistic properties of word combinations in German consisting of a preposition, a noun, and another preposition (such as auf Grund von (‘by virtue of’)), thus arguing that raising is the most appropriate method for satisfactorily describing the crucial syntactic features which are typical for those expressions. The objective of this paper is thus to demonstrate the efficiency of the raising mechanism as used in HPSG, and therefore, to emphasize the importance of designing a satisfactory uniform theory of raising within this grammar framework.
We present SPLICR, the Web-based Sustainability Platform for Linguistic Corpora and Resources. The system is aimed at people who work in Linguistics or Computational Linguistics: a comprehensive database of metadata records can be explored in order to find language resources that could be appropriate for one’s spe cific research needs. SPLICR also provides a graphical interface that enables users to query and to visualise corpora. The project in which the system is developed aims at sustainably archiving the ca. 60 language resources that have been constructed in three collaborative research centres. Our project has two primary goals: (a) To process and to archive sustainably the resources so that they are still available to the research community in five, ten, or even 20 years time. (b) To enable researchers to query the resources both on the level of their metadata as well as on the level of linguistic annotations. In more general terms, our goal is to enable solutions that leverage the interoperability, reusability, and sustainability of heterogeneous collec- tions of language resources.
This paper presents the results of a joint effort of a group of multimodality researchers and tool developers to improve the interoperability between several tools used for the annotation and analysis of multimodality. Each of the tools has specific strengths so that a variety of differ-ent tools, working on the same data, can be desirable for project work. However this usually re-quires tedious conversion between formats. We propose a common exchange format for multi-modal annotation, based on the annotation graph (AG) formalism, which is supported by import and export routines in the respective tools. In the current version of this format the common de-nominator information can be reliably exchanged between the tools, and additional information can be stored in a standardized way.
Lexical chaining has become an important part of many NLP tasks. However, the goodness of a chaining process and hence its annotation output depends on the quality of the chaining resource. Therefore, a framework for chaining is needed which integrates divergent resources in order to balance their deficits and to compare their strengths and weaknesses. In this paper we present an application that incorporates the framework of a meta model of lexical chaining exemplified on three resources and its generalized exchange format.
Although there is a growing interest of policy makers in higher education issues (especially on an international scale), there is still a lack of theoretically well-grounded comparative analyses of higher education policy. Even broadly discussed topics in higher education research like the potential convergence of European higher education systems in the course of the Bologna Process suffer from a thin empirical and comparative basis. This paper aims to deal with these problems by addressing theoretical questions concerning the domestic impact of the Bologna Process and the role national factors play in determining its effects on cross-national policy convergence. It develops a distinct theoretical approach for the systematic and comparative analysis of cross-national policy convergence. In doing so, it relies upon insights from related research areas — namely literature on Europeanization as well as studies dealing with cross-national policy convergence.
In this paper the authors briefly outline editing functions which use methods from computational linguistics and take the structures of natural languages into consideration. Such functions could reduce errors and better support writers in realizing their communicative goals. However, linguistic methods have limits, and there are various aspects software developers have to take into account to avoid creating a solution looking for a problem: Language-aware functions could be powerful tools for writers, but writers must not be forced to adapt to their tools.
One problem of data-driven answer extraction in open-domain factoid question answering is that the class distribution of labeled training data is fairly imbalanced. In an ordinary training set, there are far more incorrect answers than correct answers. The class-imbalance is, thus, inherent to the classification task. It has a deteriorating effect on the performance of classifiers trained by standard machine learning algorithms. They usually have a heavy bias towards the majority class, i.e. the class which occurs most often in the training set. In this paper, we propose a method to tackle class imbalance by applying some form of cost-sensitive learning which is preferable to sampling. We present a simple but effective way of estimating the misclassification costs on the basis of class distribution. This approach offers three benefits. Firstly, it maintains the distribution of the classes of the labeled training data. Secondly, this form of meta-learning can be applied to a wide range of common learning algorithms. Thirdly, this approach can be easily implemented with the help of state-of-the-art machine learning software.
The authors describe two data sets submitted to the database of MWE evaluation resources: (1) cranberry expressions in English and (2) cranberry expressions in German. The first package contains a collection of 444 cranberry words in German (CWde.txt) and a collection of the corresponding cranberry expressions (CCde.txt). The second package consists of a collection of 77 cranberry words in English (CWen.txt) and a collection of the corresponding cranberry expressions (CCen.txt). The data included in these packages was extracted from the Collection of Distributionally Idiosyncratic Items (CoDII), an electronic linguistic resource of lexical items with idiosyncratic occurrence patterns. Each package contains a readme file, and can be downloaded from multiword.wiki.sourceforge.net/Resources.
Digital Text Collections, Linguistic Research Data, and Mashups: Notes on the Legal Situation
(2008)
Comprehensive data repositories are an essential part of practically all research carried out in the digital humanities nowadays. For example, library science, literary studies, and computational and corpus linguistics strongly depend on online archives that are highly sustainable and that contain not only digitized texts but also audio and video data as well as additional information such as metadata and arbitrary annotations. Current Web technologies, especially those that are related to what is commonly referred to as the Web 2.0, provide a number of novel functions such as multiuser editing or the inclusion of third-party content and applications that are also highly attractive for research applications in the areas mentioned above. Hand in hand with this development goes a high degree of legal uncertainty. The special nature of the data entails that, in quite a few cases, there are multiple holders of personal rights (mostly copyright) to different layers of data that often have different origins. This article discusses the legal problems of multiple authorships in private, commercial, and research environments. We also introduce significant differences between European and U.S. law with regard to the handling of this kind of data for scientific purposes.
This work proposes opinion frames as a representation of discourse-level associations which arise from related opinion topics. We illustrate how opinion frames help gather more information and also assist disambiguation. Finally we present the results of our experiments to detect these associations.
This work proposes opinion frames as a representation of discourse-level associations that arise from related opinion targets and which are common in task-oriented meeting dialogs. We define the opinion frames and explain their interpretation. Additionally we present an annotation scheme that realizes the opinion frames and via human annotation studies, we show that these can be reliably identified.
Discourse metaphors
(2008)
The article introduces the notion of discourse metaphor, relatively stable metaphorical mappings that function as a key framing device within a particular discourse over a certain period of time. Discourse metaphors are illustrated by case studies from three lines of research: on the cultural imprint of metaphors, on the negotiation of metaphors and on cross-linguistic occurrence. The source concepts of discourse metaphors refer to phenomenologically salient real or fictitious objects that are part of interactional space (i.e., can be pointed at, like MACHINES or HOUSES) and/or occupy an important place in cultural imagination. Discourse metaphors change both over time and across the discourses where they are used. The implications of focussing on different types of source domains for our thinking about the embodiment and sociocultural situatedness of metaphor is discussed, with particular reference to recent developments in Conceptual Metaphor Theory. Research on discourse suggests that situatedness is a crucial factor in the functioning and dynamics of metaphor.
E-VALBU: Advanced SQL/XML processing of dictionary data using an object-relational XML database
(2008)
Contemporary practical lexicography uses a wide range of advanced technological aids,most prominently database systems for the administration of dictionary content. Since XML has become a de facto standard for the coding of lexicographic articles, integrated markup functionality – such as query, update, or transformation of instances – is of particular importance. Even the multi-channel distribution of dictionary data benefits from powerful XML database services. Exemplified by E-VALBU, the most comprehensive electronic dictionary on German verb valency, we outline an integrated approach for advanced XML storing and processing within an object-relational database, and for a public retrieval frontend using Web Services and AJAX technology.
As many popular text genres such as blogs or news contain opinions by multiple sources and about multiple targets, finding the sources and targets of subjective expressions becomes an important sub-task for automatic opinion analysis systems. We argue that while automatic semantic role labeling systems (ASRL) have an important contribution to make, they cannot solve the problem for all cases. Based on the experience of manually annotating opinions, sources, and targets in various genres, we present linguistic phenomena that require knowledge beyond that of ASRL systems. In particular, we address issues relating to the attribution of opinions to sources; sources and targets that are realized as zero-forms; and inferred opinions. We also discuss in some depth that for arguing attitudes we need to be able to recover propositions and not only argued-about entities. A recurrent theme of the discussion is that close attention to specific discourse contexts is needed to identify sources and targets correctly.
This paper is a project report of the lexicographic Internet portal OWID, an Online Vocabulary Information System of German which is being built at the Institute of German Language in Mannheim (IDS). Overall, the contents of the portal and its technical approaches will be presented. The lexical database is structured in a granular way which allows to extend possible search options for lexicographers. Against the background of current research on using electronic dictionaries, the project OWID is also working on first ideas of useradapted access and user-adapted views of the lexicographic data. Due to the fact that the portal OWID comprises dictionaries which are available online it is possible to change the design and functions of the website easily (in comparison to printed dictionaries). Ideas of implementing user-adapted views of the lexicographic data will be demonstrated by using an example taken from one of the dictionaries of the portal, namely elexiko.
In this paper, we present a suite of flexible UIMA-based components for information retrieval research which have been successfully used (and re-used) in several projects in different application domains. Implementing the whole system as UIMA components is beneficial for configuration management, component reuse, implementation costs, analysis and visualization.
The multiple gradations of German strong verbs are but manifestations of a rather uncomplicated system. There is a small number of ways to make up ablaut forms; these types of formation are identifiable in formal terms and, what is more, they have definite functions as morphological markers. Using classifications of stem forms according to quality, complexity and quantity of vowels, three types of operations involved in ablaut formation are identified. Ablaut always includes a change of quality type or a change of complexity type, and in addition it may include a change of quantity type. Ablaut forms are clearly distinguished as against bases (and against each other): their vocalism meets a defined standard of dissimilarity. On this basis, gradations are collected into inflectional classes that are defined in strictly synchronic terms. These classes continue the historical seven classes known from reference grammars. For the majority of strong verbs, membership in these classes (and thus ablaut) is predictable.
The thesis describes a fully automatic system for the resolution of the pronouns 'it', 'this', and 'that' in English unrestricted multi-party dialog. Referential relations considered include both normal NP-antecedence as well as discourse-deictic pronouns. The thesis contains a theoretical part with a comprehensive empiricial study, and a practical part describing machine learning experiments.
The research project “German Today” aims to determine the amount of regional variation in (near-)standard German spoken by young and older educated adults and to identify and locate regional features. To this end, we compile an areally extensive corpus of read and spontaneous German speech. Secondary school students and 50-to-60-year-old locals are recorded in 160 cities throughout the German speaking area of Europe. All participants read a number of short texts and a word list, name pictures, translate words and sentences from English, answer questions in a sociobiographic interview, and take part in a map task experiment. The resulting corpus comprises over 1000 hours of speech, which is transcribed orthographically. Automatically derived broad phonetic transcriptions, selective manual narrow phonetic transcriptions, and variationalist annotations are added. Focussing on phonetic variation we aim to show to what extent national or regional standards exist in spoken German. Furthermore, the linguistic variation due to different contextual styles (read vs. spontaneous speech) shall be analysed. Finally, the corpus enables us to investigate whether linguistic change has occurred in spoken (near-)standard German.
How to Compare Treebanks
(2008)
Recent years have seen an increasing interest in developing standards for linguistic annotation, with a focus on the interoperability of the resources. This effort, however, requires a profound knowledge of the advantages and disadvantages of linguistic annotation schemes in order to avoid importing the flaws and weaknesses of existing encoding schemes into the new standards. This paper addresses the question how to compare syntactically annotated corpora and gain insights into the usefulness of specific design decisions. We present an exhaustive evaluation of two German treebanks with crucially different encoding schemes. We evaluate three different parsers trained on the two treebanks and compare results using EVALB, the Leaf-Ancestor metric, and a dependency-based evaluation. Furthermore, we present TePaCoC, a new testsuite for the evaluation of parsers on complex German grammatical constructions. The testsuite provides a well thought-out error classification, which enables us to compare parser output for parsers trained on treebanks with different encoding schemes and provides interesting insights into the impact of treebank annotation schemes on specific constructions like PP attachment or non-constituent coordination.
Current Natural Language Processing (NLP) systems feature high-complexity processing pipelines that require the use of components at different levels of linguistic and application specific processing. These components often have to interface with external e.g. machine learning and information retrieval libraries as well as tools for human annotation and visualization. At the UKP Lab, we are working on the Darmstadt Knowledge Processing Software Repository (DKPro) (Gurevych et al., 2007a; Müller et al., 2008) to create a highly flexible, scalable and easy-to-use toolkit that allows rapid creation of complex NLP pipelines for semantic information processing on demand. The DKPro repository consists of several main parts created to serve the purposes of different NLP application areas
Introduction
(2008)
In this paper we investigate the coverage of the two knowledge sources WordNet and Wikipedia for the task of bridging resolution. We report on an annotation experiment which yielded pairs of bridging anaphors and their antecedents in spoken multi-party dialog. Manual inspection of the two knowledge sources showed that, with some interesting exceptions, Wikipedia is superior to WordNet when it comes to the coverage of information necessary to resolve the bridging anaphors in our data set. We further describe a simple procedure for the automatic extraction of the required knowledge from Wikipedia by means of an API, and discuss some of the implications of the procedure’s performance.
Europe is a continent of many languages. We all know that, but normally when we think about this fact, we focus on national languages, the type of language that shapes our political and our linguistic geography. But as natural as it may seem today, the idea of a language closely being interrelated with one's identity does not have a very long tradition. In fact it is only since the late 18th century that we think there is some type of intimate connection between the language spoken and the identity of a person as belonging to a nation. And even if the stabilization of European nation states was closely connected with this type of reasoning, European language communities differ considerably in their way of dealing with natural variation within their national language. For some of them, it is only the standardized national language that is relevant in this respect; for others, a certain amount of variation is a central part of their linguistic identity.
Language-aware text editing
(2008)
While software developers have various power tools at their disposal that make the writing of computer programs more efficient, authors of texts do not have the support of such power tools. Text processors still operate on the level of characters and strings rather than on the level of word forms and grammatical constructions. This forces authors to constantly switch between low-level, character oriented, editing operations and high-level, conceptual, verbalisation processes. We suggest the development of language-aware text editing tools that simplify certain frequent, yet complex editing operations by defining them on the level of linguistic units. Pluralizing an entire noun phrase plus the verb forms governed by it would be an ambitious example, swapping the elements of a conjunctive construction a more modest one. We describe a pilot implementation for German where these operations are seamlessly integrated with the standard functions of an existing open-source editor. The operations can be invoked on demand and do not intrude on the authoring process. Changes can be performed locally or globally, thus simplifying the writing process considerably, and making the resulting texts more consistent.
Research on syntactic ambiguity resolution in language comprehension has shown that subjects' processing decisions are influenced by a variety of heterogeneous factors such as e.g., syntactic complexity, semantic fit and the discourse frequency of the competing structures. The present paper investigates a further potentially relevant factor in such processes: effects of syntagmatic lexical chunking (or matching to a complex memorized prefab) whose occurrence would be predicted from usage-based assumptions about linguistic categorisation. Focusing on the widely studied so-called DO/SC-ambiguity in which a post-verbal NP is syntactically ambiguous between a direct object and the subject of an embedded clause, potentially biasing collocational chunks of the relevant type are identified in a number of corpus-linguistic pretests and then investigated in a self-paced reading experiment. The results show a significant increase in processing difficulty from a collocationally neutral over a lexically biasing to a strongly biasing condition. This suggests that syntagmatically complex and partially schematic templates of the kind envisioned in usage-based Construction Grammar may impinge on speakers' online processing decisions during sentence comprehension.
Lexicography
(2008)
Lexicon schemas and their use are discussed in this paper from the perspective of lexicographers and field linguists. A variety of lexicon schemas have been developed, with goals ranging from computational lexicography (DATR) through archiving (LIFT, TEI) to standardization (LMF, FSR). A number of requirements for lexicon schemas are given. The lexicon schemas are introduced and compared to each other in terms of conversion and usability for this particular user group, using a common lexicon entry and providing examples for each schema under consideration. The formats are assessed and the final recommendation is given for the potential users, namely to request standard compliance from the developers of the tools used. This paper should foster a discussion between authors of standards, lexicographers and field linguists.
Badania etnolingwistyczne zdobyly w ciqgu ostatnich dwu dekad znaozna populamosc. Najwazniejsz^ formuh\ nietaforycznn okreslajqcii glowny przedmiot tych badaií jest .jçzykowy obraz swiata”. W zwiqzku z tym. iz powstaj^ obecnie projekty studiów komparatyslycznych na duzíi skalç, warto byt moze rozwazyc, czego takie ujçcie etnolingwistyki nie uwzglçdnia. Wizualna metafora obrazów implikuje, ze mówincy si\ w slanie wyjsc ix>za swiat i patrzec nan (oraz nazywac go) z zewmprz. Artykul oinawia dwie kcinsekwencje tej inetafory, które mog^ przysporzyc problemów. Po pierwsze, wyizolowanie jçzyka ze swiata ludzkich dzialan, którego jyzyk wszak jest czçsci^. prowadzi do przyjçcia kognitywistycznego modeln znaczenia jako oddzielnego stmmienia komunikaeji. Taki model nie pasuje do eodziennego doswiadezenia przezroczystosci jyzyka. Po drugie, wyizolowanie jçzyka z zycia sprzyja stosowaniu metod „bezczasowych” oraz studiom nad stowami wyalKtrahowanymi z sytuaeji, w której zostaly one uzyte (jesli nie wyjçtymi z kontekstu). Przyjmuj^c takie metafory i inetody, inozetny stracic z oczu znaczn^ czçsc tego, co jest istotne dla jyzyka poUx;znego — przedmiotu badan etnonauki.
The metadata management system for speech corpora “memasysco” has been developed at the Institut für Deutsche Sprache (IDS) and is applied for the first time to document the speech corpus “German Today”. memasysco is based on a data model for the documentation of speech corpora and contains two generic XML schemas that drive data capture, XML native database storage, dynamic publishing, and information retrieval. The development of memasysco’s information architecture was mainly based on the ISLE MetaData Initiative (IMDI) guidelines for publishing metadata of linguistic resources. However, since we also have to support the corpus management process in research projects at the IDS, we need a finer atomic granularity for some documentation components as well as more restrictive categories to ensure data integrity. The XML metadata of different speech corpus projects are centrally validated and natively stored in an Oracle XML database. The extension of the system to the management of annotations of audio and video signals (e.g. orthographic and phonetic transcriptions) is planned for the near future.
The present study examines the dynamics of the kanji combinations that form common (or general) and proper nouns in Japanese. The following three results were obtained. First, the degree of distribution results from two similar processes which are based on a steady-state of birth-and-death processes with different birth and death rates, resulting in a positive negative binomial distribution with the proper nouns and in a positive Waring distribution with common nouns. Second, all rank-frequency distributions follow the negative hypergeometric distribution used very frequently in ranking problems. Third, the building of kanji compounds follows a dissortative strategy. The higher the outdegree of a kanji, the more it prefers kanji with lower indegrees. A linear dependence can be observed with common nouns, whereas the relationship between compounded kanji is rather curvilinear with proper nouns. The actual analytical expression is not yet known.
In the project SemDok (Generic document structures in linearly organised texts) funded by the German Research Foundation DFG, a discourse parser for a complex type (scientific articles by example), is being developed. Discourse parsing (henceforth DP) according to the Rhetorical Structure Theory (RST) (Mann and Taboada, 2005; Marcu, 2000) deals with automatically assigning a text a tree structure in which discourse segments and rhetorical relations between them are marked, such as Concession. For identifying the combinable segments, declarative rules are employed, which describe linguistic and structural cues and constraints about possible combinations by referring to different XML annotation layers of the input text, and external knowledge bases such as a discourse marker lexicon, a lexico-semantic ontology (later to be combined with a domain ontology), and an ontology of rhetorical relations. In our text-technological environment, the obvious choice of formalism to represent such ontologies is OWL (Smith et al., 2004). In this paper, we describe two OWL ontologies and how they are consulted from the discourse parser to solve certain tasks within DP. The first ontology is a taxononomy of rhetorical relations which was developed in the project. The second one is an OWL version of GermaNet, the model of which we designed together with our project partners.
In the context of a Nordic Conference on Bilingualism, it can be a rewarding task to look at issues such as language planning, policy and legislation from a perspective of the southern neighbours of the Nordic world. This paper therefore intends to point attention towards a case of societal multilingualism at the periphery of the Nordic world by dealing with recent developments in language policy and legislation with regard to the North Frisian speech community in the German Land of Schleswig-Holstein. As I will show, it is striking to what degree there are considerable differences in the discourse on minority protection and language legislation between the Nordic countries and a cultural area which may arguably be considered to be part of the Nordic fringe - and which itself occasionally takes Scandinavia as a reference point, e.g. in the recent adoption of a pan-Frisian flag modelled on the Nordic cross (Falkena 2006).
The main focus of the paper will be on the Frisian Act which was passed in the Parliament of Schleswig-Holstein in late 2004. It provides a certain legal basis for some political activities with regard to Frisian, but falls short of creating a true spirit of minority language protection and/or revitalisation. In contrast to the traditions of the German and Danish minorities along the German-Danish border and to minority protection in Northern Scandinavia (in particular to Sámi language rights), the approach chosen in the Frisian Act is extremely weak and has no connotation of long-term oriented language-planning, let alone a rights-based perspective.
The paper will then look at policy developments in the time since the Act was passed, e.g. in the Schleswig-Holstein election campaign in 2005, and on latest perceptions of the Frisian language situation in the discourse on North Frisian Policy in Schleswig-Holstein majority society. In the final part of the paper, I will discuss reasons for the differences in minority language policy discourse between Germany and the Nordic countries, and try to provide an outlook on how Frisian could benefit from its geographic proximity to the Nordic world.
In this paper, the authors describe a semi-automated approach to refine the dictionary-entry structure of the digital version of the Wörterbuch der deutschen Gegenwartssprache (WDG, en.: Dictionary of Present-day German), a dictionary compiled and published between 1952 and 1977 by the Deutsche Akademie der Wissenschaften that comprises six volumes with over 4,500 pages containing more than 120,000 headwords. We discuss the benefits of such a refinement in the context of the dictionary project Digitales Wörterbuch der deutschen Sprache (DWDS, en: Digital Dictionary of the German language). In the current phase of the DWDS project, we aim to integrate multiple dictionary and corpus resources in German language into a digital lexical system (DLS). In this context, we plan to expand the current DWDS interface with several special purpose components, which are adaptive in the sense that they offer specialized data views and search mechanisms for different dictionary functions-e.g. text comprehension, text production-and different user groups-e.g. journalists, translators, linguistic researchers, computational linguists. One prerequisite for generating such data views is the selective access to the lexical items in the article structure of the dictionaries which are the object of study. For this purpose, the representation of the eWDG has to be refined. The focus of this paper is on the semiautomated approach used to transform eWDG into a refined version in which the main structural units can be explicitly accessed. We will show how this refinement opens new and flexible ways of visualizing and querying the lexicographic content of the refined version in the context of the DLS project.
Rescuing Legacy Data
(2008)
This paper discusses issues that arise in the transformation of electronic language data from outdated to modern, sustainable formats. We first describe the problem and then present four different cases in which corpora of spoken language were converted from legacy formats to an XML-based representation. For each of the four cases, we describe the conversion workflow and discuss the difficulties that we had to overcome. Based on this experience, we formulate some more general observations about transforming legacy data and conclude with a set of best practice recommendations for a more sustainable handling of language corpora.
The development of user-adapted views of lexicographic data is frequently in demand by dictionary research on electronic reference works and hypertext information systems. In the printed dictionary it has been indispensable to develop a complete dictionary relative to a user group and using situations. In contrast, for any electronic presentation of lexicographic data there are possibilities to define user-specific views of an initially user-unspecific resource. However, research on the use of dictionaries in general, still has to answer several open questions as far as this subject is concerned. This paper will firstly provide an overview of the present state of research on dictionary use with respect to electronic lexicography. Subsequently, explanations of further prerequisites for a possible user-adapted access to data are followed, as exemplified by OWID, the Online Vocabulary Information System of the Institut für Deutsche Sprache. Finally, it will be outlined what results on the subject have been accomplished so far. Also the prospects of potential user-adapted presentations of lexicographic data will be highlighted.
This is a study of how aspects of information structure can be captured within a formal grammar of Spanish, couched in the framework of Head-Driven Phrase Structure Grammar (HPSG, Pollard
and Sag 1994). While a large number of morphological, syntactic and semantic aspects in a variety of languages have been successfully analysed in this theory, information structure has not been paid the same attention in the HPSG literature. However, as a theory of signs, HPSG should include all
levels of description without which the structural descriptions offered by the grammar would ultimately remain incomplete. Languages often explicitly mark the information-structural partitioning of utterances. Depending on the particular language, linguistic resources used for this purpose include
prosody (stress/intonation), syntax (e. g. constituent order, special syntactic constructions) and morphology (e. g. special affixes). In HPSG, phonological, syntactic, semantic and pragmatic information is represented in parallel, which would seem to be a well-suited architecture for modelling
the sort of interfaces called for.
Our research task consists in the study of the way in which multilingual resources are mobilized in team work within collaborative activities; how they are exploited in a specific way in order both to enhance collaboration and to respect the specificities of the members’ linguistic competences and practices within the team. Central to our analytical work, which is inspired by ethnomethodological conversation analysis, is the relationship between multilingual resources and the situated organization of linguistic uses and of social practices. These two aspects are reflexively articulated, multilingual resources being shaped by the very contexts of their use and activities being constrained and thus structured by the available resources.
This article discusses the question whether the distinction between subordination and coordination is parallel in syntax and discourse. Its main thesis is that subordination and coordination, as they are commonly understood in the linguistic literature, are genuinely syntactic concepts. The distinction between hierarchical and non-hierarchical connection in discourse structure, as far as it is defined clearly in the literature, is of a quite different nature. The syntax and semantics of connectives (as the most prominent morphosyntactic means by which subordination and coordination are encoded) offers little evidence to support the assumption of a structural parallelism between syntax and discourse. As a methodological consequence, sentence and discourse structure should not be mixed up in linguistic analysis.
Between 1884 and 1900, Germany established protectorates in large areas of the South Pacific. The authorities assumed that the linguistically extremely diverse areas would pose communication problems. Thus the question arose whether German should become the lingua franca in the South Pacific. After a controversial discussion; the German government implemented language policies to promote the German language in the colonies. This chapter shows why, on the one hand, German language policies were doomed to failure and why, on the other, they unintentionally supported other linguistic developments such as the introduction of borrowing from German into indigenous languages, the development of German settler varieties, and the spread of pidgin languages.
In our study we use the experimental framework of priming to manipulate our subjects’ expectations of syllable prominence in sentences with a well-defined syntactic and phonological structure. It shows that it is possible to prime prominence patterns and that priming leads to significant differences in the judgment of syllable prominence.
This paper presents the Kicktionary, a multilingual (English - German - French) electronic lexical resource of the language of football. In the Kicktionary, methods from corpus linguistics and two approaches to lexical semantics - the theory of frame semantics and the concept of semantic relations - are combined to construct a lexical resource in which the user can explore relationships between lexical units in various ways. This paper explains the theoretical background of the Kicktionary, sketches the data and methods which were used in its construction, and describes how the resulting resource is presented to users via a set of hyperlinked webpages.
Laz, a sister language of Georgian spoken on the southeastem coast of the Black Sea, is the only member of the South Caucasian family which is spoken primarily in Turkey. Due to the socio-political circumstances all Speakers of Laz living in Turkey are bilingual and use Laz primarily in private communication. Using these observations as a starting point, the paper looks at the question of whether Laz is an endangered language. In order to clarify the sociolinguistic Situation of Laz in Turkey, the different levels involved in the process of gradual language loss (language-extemal factors, speech behaviour and structural consequences within the language system) are dealt with in detail. To determine which data should be taken as basis for the documentation of the language, the paper also discusses linguistic criteria for differentiating between fully competent Speakers of Laz and Speakers who show signs of language attrition.
The Online-Wortschatz-Informationssystem Deutsch (OWID; Online Vocabulaty Information System German) o f the Institut fUr Deutsche Sprache (IDS; German Language Institute) in Mannheim is a lexicographic Internet portal for various electronic diciionary resources that are being compiled as the IDS. It is an explicit goal of OWID, not to present a random collection of unrelated reference works but to build a network of actually related lexicographic products. Hence, the core of the project is the design of an innovative concept of data modelling and structuring. The goal of this granular data modelling is to allow flexible access of each individual lexicographic resource as well as access across diverse dictionary resources. At the same time, fine-grained interconnectedness of all resources should be made possible. Every lexicographic resource within OWID—elexiko, Neologismenwörterbuch, Wortverbindungen online, Schulddiskurs im ersten Nachkriegsjahrzehnt—accomplishes this requirement with regard to data modelling and structuring. The paper explains the underlying consistent concept of the data modelling for the overall heterogeneous lexicographical resources. Also it is shown, how the modelling potential has been converting into the Internet presence of OWID.
The Meta-data-Database of a Next Generation Sustainability Web-Platform for Language Resources
(2008)
Our goal is to provide a web-based platform for the long-term preservation and distribution of a heterogeneous collection of linguistic resources. We discuss the corpus preprocessing and normalisation phase that results in sets of multi-rooted trees. At the same time we transform the original metadata records, just like the corpora annotated using different annotation approaches and exhibiting different levels of granularity, into the all-encompassing and highly flexible format eTEI for which we present editing and parsing tools. We also discuss the architecture of the sustainability platform. Its primary components are an XML database that contains corpus and metadata files and an SQL database that contains user accounts and access control lists. A staging area, whose structure, contents, and consistency can be checked using tools, is used to make sure that new resources about to be imported into the platform have the correct structure.
‘Linguistic relativity’ has become a major keyword in debates on the psychological significance of language diversity. In this context, the term ‘relativity’ was originally taken on loan from Einstein’s then-recent theories by Edward Sapir (1924) and Benjamin L. Whorf (1940). The present paper assesses how far the idea of linguistic relativity does analogically build on relevant insights in modern physics, and fails to find any substantial analogies. The term was used rhetorically by Sapir and Whorf, and has since been incorporated into a cognitivist research programme that seeks to answer whether ‘language influences thought’. Contemporary research on ‘linguistic relativity’ has developed into a distinct way of studying language diversity, which shares a lot with the universalistic cognitivist framework it opposes, but little with relational approaches in science.
In spring 2002, we celebrated the inauguration of the first German-Russian-Jewish kindergarten in Berlin. Nowadays, there are seven bilingual German-Russian kindergartens with 4 60 places and 78 bilingual kindergartens with other combinations of languages [SENBWF]. Maybe it is not enough, taking into account the large proportion o f immigrants in the population of Berlin1. And yet, much progress has been achieved, endorsing the fact that German society has begun to change its attitude towards other languages on its territory. The initial request for German monolingualism first changed into societal tolerance of multilingualism and eventually to the recognition o f the value of multilingualism. This process is a very slow one, and it is not yet complete. In my article, I would like to look at the development in the last few years of the political framework that has made possible, on the one hand, the opening of bilingual kindergartens in Berlin, and on the other hand, to consider what has hampered this process until now. I would like to emphasise three most important political spheres: linguistic, educational and integrational.
This contribution deals with the representation of verbs with multiple meanings or senses in general monolingual dictionaries. Criteria for differentiating senses in dictionary entries have traditionally been formulated with respect to the vocabulary in general. This paper argues that, while some criteria do indeed apply to the entire lexicon, many of them are relevant only to specific semantic classes. This will be demonstrated considering two selected verb classes: speech-act verbs and perception verbs. Like verbs of other classes, speech-act verbs and perception verbs may be ambiguous in different but recurrent ways. Since recurrent patterns of ambiguity are always typical of particular semantic classes, class-specific semantic criteria are formulated to decide whether a particular ambiguous speech act or perception verb should be treated as being polysemous or homonymous in dictionary entries. In addition to these class-specific semantic criteria, the semantic-syntactic criterion of identity or difference of argument structure is suggested for the lexicographical representation of verbs which may not be considered to be polysemous or homonymous on the basis of semantic criteria alone. According to the suggested argument-structure criterion, these verbs should be treated as polysemous when their senses correlate with identical argument structures and as homonymous when their senses correlate with different argument structures properties. As opposed to the semantic criteria suggested, the semantic-syntactic criterion of identity vs. difference of argument structure applies to verbs of different semantic classes. However, as will be illustrated by the discussion of the different senses of smell, it may sometimes force us to treat different but related senses as corresponding to two distinct lexical items. In order to solve this problem, the criteria suggested are supplemented by a preference rule stating that semantic criteria apply prior to the semantic-syntactic criterion of identity vs. difference of argument structure...
The paper reports on experiments with acoustic recordings of a self-built replica of the historic speaking machine of Wolfgang von Kempelen. Several possibilities of the reed as the glottal excitation mechanism were tested. Perception tests with naïve listeners revealed that the machinegenerated words 'mama' and 'papa' were partially recognised as an authentic child voice – as it was also the case in von Kempelen's demonstrations in the late 18th century.
Open peer commentary on the target article “Who Conceives of Society?” by Ernst von Glasersfeld. Excerpt: I will focus on one crucial step in von Glasersfeld’s argumentation, viz. his view that every individual constructs his own private meanings (understood as conceptual structures or elements thereof) for linguistic expressions, so that linguistic interaction and even communication in general is based on a notion of compatibility between different speakers’ private conceptual schemes. The central question here is: “Just what does it mean that different private conceptual schemes (private meanings) are compatible, or what constitutes a viable criterion to this end?” As von Glasersfeld himself stresses twice (§28, §37), the criteria to be looked for can only be “public,” residing in properties of verbal and non-verbal actions of the interacting individuals, properties that can be sensed and processed by the participating system.