Refine
Year of publication
- 2015 (28) (remove)
Document Type
- Conference Proceeding (15)
- Article (10)
- Part of a Book (2)
- Book (1)
Has Fulltext
- yes (28)
Keywords
- Korpus <Linguistik> (10)
- Annotation (9)
- Corpus annotation (6)
- Corpus technology (6)
- Datenbanksystem (6)
- Deutsch (5)
- Large corpora (5)
- Corpus linguistics (4)
- Corpus management (3)
- Corpus query language (3)
Publicationstate
- Veröffentlichungsversion (28) (remove)
Reviewstate
- Peer-Review (28) (remove)
Publisher
- Institut für Deutsche Sprache (8)
- Association for Computational Linguistics ( ACL ); Curran Associates, Inc. (1)
- De Gruyter (1)
- Federal´noe gosudarstvennoe unitarnoe predprijatie Akademičeskij naučno-izdatel´skij, proizvodstvenno-poligrafičeskij i knigorasprostranitel´skij centr Nauka (1)
- German Society for Computational Linguistics & Language Technology (GSCL) (1)
- Gesellschaft für Sprachtechnologie and Computerlinguistik (1)
- International Phonetic Association (IPA) (1)
- Linköping University Electronic Press (1)
- Linköping University Electronic Press, Linköpings universitet (1)
- Nodus Publikationen (1)
Cet article se penche sur un épisode radiophonique durant lequel deux animateurs effectuent un coming out hétérosexuel à l’occasion de la journée internationale du coming out (11 octobre). Dans une perspective issue de l’analyse conversationnelle d’inspiration ethnométhodologique, il étudie une collection d’occurrences de coming out, permettant non seulement d’identifier un format séquentiel récurrent et la manière dont il contribue à l’efficacité de la pratique, mais aussi de réfléchir à la façon dont il peut être utilisé dans différents contextes sociaux, notamment médiatisés et médiatiques. En particulier, l’article montre comment la pratique est au service d’une émission radiophonique sur le coming out et prépare la transition vers le traitement de l’homosexualité à la radio. Grâce à un enregistrement vidéo du travail des animateurs dans le studio de radio, l’article décrit la façon dont le thème de la journée internationale du coming out est fabriqué et orchestré dans les coulisses de la radio et sur les ondes. Ce faisant, il montre la contribution d’une analyse conversationnelle à l’approche du coming out dans les études de genre – où la pratique est largement discutée mais sans être analysée sur la base d’occurrences documentées. L’article revient ainsi sur l’épistémologie du closet chère à Eve Sedgwick, en proposant une anatomie du coming out en contexte médiatisé, qui en éclaire les enjeux non seulement épistémiques mais aussi de normativisation, publicisation et spectacularisation.
In diesem Aufsatz werden Positionierungsverfahren analysiert, welche die Macher einer Talkshow einsetzen, um ihre Gäste den Fernsehzuschauern als relevante Gesprächspartner für das Thema „Steuerhinterziehung durch Prominente” zu präsentieren. Es wird untersucht, wie es den Machern der Talkshow gelingt, die Gäste bereits bei der Erstvorstellung durch das Zusammenspiel einer Stimme aus dem Off und der Kameraführung als „prototypische Vertreter” zu präsentieren und zueinander zu positionieren. Von den insgesamt fünf Teilnehmern der Talkshow werden zwei dieser Erstvorstellungen detailliert analysiert. Es handelt sich um die Präsentation zweier Gäste, die in einer deutlich antagonistischen Beziehung zueinander stehen. Diese Gäste werden unmittelbar hintereinander vorgestellt. Auf der Grundlage aller fünf Gastpräsentationen, die wir detailliert rekonstruiert haben, jedoch aus Platzgründen hier leider nicht ebenfalls präsentieren können, wird ein strukturiertes Positionierungsgeflecht deutlich. Dieses Geflecht weist im Zentrum die von uns rekonstruierte thematische und personelle „Gegnerschaft“ auf. In der Peripherie sind dann insgesamt vier Vertreter relevanter gesellschaftlicher Positionen zum Thema der Talkshow beigeordnet. Dabei handelt es sich um Vertreter der Rechtsprechung, der Politik, der Alltagsmoral und der Psychologie und Theologie. Die Analysen werden in theoretischer Hinsicht auf der Grundlage multimodaler Vorstellungen zur Positionierung und zum Recipient Design durchgeführt. In methodisch-methodologischer Perspektive orientiert sich die Analyse an der multimodalen Interaktionsanalyse.
The IMS Open Corpus Workbench (CWB) software currently uses a simple tabular data model with proven limitations. We outline and justify the need for a new data model to underlie the next major version of CWB. This data model, dubbed Ziggurat, defines a series of types of data layer to represent different structures and relations within an annotated corpus; each such layer may contain variables of different types. Ziggurat will allow us to gradually extend and enhance CWB’s existing CQP-syntax for corpus queries, and also make possible more radical departures relative not only to the current version of CWB but also to other contemporary corpus-analysis software.
In dem Beitrag wird der Frage nachgegangen, inwiefern die Frequenz eines Wortes mit seiner orthographischen Richtigschreibung zusammenhangt. Werden häufige Wörter öfter und früher richtig geschrieben? Und welche Rolle spielt dabei die orthographische Regelhaftigkeit der Wortstrukturen? Unter Zuhilfenahme maschineller Analyseverfahren aus der Großstudie "Automatisierte Rechtschreibdiagnostik" (Fay/Berkling/Stüker 2012) werden diesbezuglich über 1000 Schülertexte von Klasse 2 bis 8 untersucht. Im Ergebnis werden zum einen einige Annahmen, die bislang vor allem auf Erfahrungswerten aus der sprachdidaktischen Arbeit fußten, empirisch bestätigt, zum anderen werden sie hinsichtlich spezifischer Rechtschreibphänomene differenziert und erweitert.
Recipient design is a key constituent of intersubjectivity in interaction. Recipient design of turns is informed by prior knowledge about and shared experience with recipients. Designing turns in order to be maximally effective for the particular recipient(s) is crucial for accomplishing intersubjectively coordinated action. This paper reports on a specific pragmatic structure of recipient design, i.e. counter-factual recipient design, and how it impinges on intersubjectivity in interaction. Based on an analysis of video-recordings data from driving school lessons in German, two kinds of counterfactual recipient design of instructors' requests are distinguished: pedagogic and egocentric turn-design. Counterfactual, pedagogic turn-design is used strategically to diagnose student skills and to create opportunities for corrective instructions. Egocentric turn-design rests on private, non-shared knowledge of the instructor. Egocentrically designed turns imply expectations of how to comply with requests which cannot be recovered by the student and which lead to a breakdown of intersubjective cooperation. This paper identifies practices, sources and interactional consequences of these two kinds of counterfactual recipient design. In addition, the study enhances our understanding of recipient design in at least three ways. It shows that recipient design does not only concern referential and descriptive practices, but also the indexing intelligible projections of next actions; it highlights the productive, other-positioning effects of recipient design; it argues that recipient design should be analyzed in terms of temporally extended interactional trajectories, linking turn-constructional practices to interactional histories and consecutive trajectories of joint action.
Die öffentliche Akzeptanz und Wirkung natur- und technikwissenschaftlicher Forschung hängt grundlegend davon ab, ob sich die Ziele und Forschungsergebnisse an die Öffentlichkeit vermitteln lassen. Doch die Inhalte aktueller Forschungsvorhaben sind für ein Laienpublikum oft nur schwer zugänglich und verständlich. Vor dem Hintergrund, die gesellschaftliche Diskussion natur- und technikwissenschaftlicher Forschung zu verbessern, untersuchen und bewerten wir im Projekt PopSci – Understanding Science einen wichtigen Sektor des populärwissenschaftlichen Diskurses in Deutschland empirisch. Hierfür identifizieren wir die linguistischen Merkmale deutscher populärwissenschaftlicher Texte durch korpusbasierte Methoden und untersuchen deren Effekt auf die kognitive Verarbeitung der Texte durch Laien. Dazu setzen wir Vor- und Nachwissenstests ein. Außerdem messen wir die Blickbewegungen der Leserinnen und Leser, während sie populärwissenschaftliche Texte lesen. Aus dieser Kombination von unterschiedlichen Methoden versuchen wir, erste Empfehlungen zur Verbesserung des linguistischen Stils und der Wissensrepräsentation populärwissenschaftlicher Texte abzuleiten.
Some 25 years ago, a large-scale repatriation of Russian Germans began. As a result, more than 2,5 million people that grew up in the USSR, Russia, or other post-Soviet states, became German citizens who had native or near-native command of the Russian language. The uncomfortable differences they exhibited in comparison to those who were supposed to accept them as equals, yet failed to do so, compelled them to search for self-designations that would accommodate their new identity and to bond together to form a new minority. The authors examine the attempts of Soviet/Russian Germans to redefine their ethnic identity in terms of not just blood but also language and culture, focusing on two particular cases: the use of the name Rusak in the internet forums of the repatriated immigrants; and the linguistic-cultural practices of the older generation of immigrants.
In a project called "A Library of a Billion Words" we needed an implementation of the CTS protocol that is capable of handling a text collection containing at least 1 billion words. Because the existing solutions did not work for this scale or were still in development I started an implementation of the CTS protocol using methods that MySQL provides. Last year we published a paper that introduced a prototype with the core functionalities without being compliant with the specifications of CTS (Tiepmar et al., 2013). The purpose of this paper is to describe and evaluate the MySQL based implementation now that it is fulfilling the specifications version 5.0 rc.1 and mark it as finished and ready to use. Further information, online instances of CTS for all described datasets and binaries can be accessed via the projects website.
The Czech National Corpus (CNC) is a longterm project striving for extensive and continuous mapping of the Czech language. This effort results mostly in compilation, maintenance and providing free public access to a range of various corpora with the aim to offer a diverse, representative, and high-quality data for empirical research mainly in linguistics. Since 2012, the CNC is officially recognized as a research infrastructure funded by the Czech Ministry of Education, Youth and Sports which has caused a recent shift towards user service-oriented operation of the project. All project-related resources are now integrated into the CNC research portal at http://www.korpus.cz/. Currently, the CNC has an established and growing user community of more than 4,500 active users in the Czech Republic and abroad who put almost 1,900 queries per day using one of the user interfaces. The paper discusses the main CNC objectives for each particular domain, aiming at an overview of the current situation supplemented by an outline of future plans.
In this paper, I present the COW14 tool chain, which comprises a web corpus creation tool called texrex, wrappers for existing linguistic annotation tools as well as an online query software called Colibri2. By detailed descriptions of the implementation and systematic evaluations of the performance of the software on different types of systems, I show that the COW14 architecture is capable of handling the creation of corpora of up to at least 100 billion tokens. I also introduce our running demo system which currently serves corpora of up to roughly 20 billion tokens in Dutch, English, French, German, Spanish, and Swedish
Contents:
1. Michal Křen: Recent Developments in the Czech National Corpus, S. 1
2. Dan Tufiş, Verginica Barbu Mititelu, Elena Irimia, Stefan Dumitrescu, Tiberiu Boros, Horia Nicolai Teodorescu: CoRoLa Starts Blooming – An update on the Reference Corpus of Contemporary Romanian Language, S. 5
3. Sebastian Buschjäger, Lukas Pfahler, Katharina Morik: Discovering Subtle Word Relations in Large German Corpora, S. 11
4. Johannes Graën, Simon Clematide: Challenges in the Alignment, Management and Exploitation of Large and Richly Annotated Multi-Parallel Corpora, S. 15
5. Stefan Evert, Andrew Hardie: Ziggurat: A new data model and indexing format for large annotated text corpora, S. 21
6. Roland Schäfer: Processing and querying large web corpora with the COW14 architecture, S. 28
7. Jochen Tiepmar: Release of the MySQL-based implementation of the CTS protocol, S. 35
We present studies using the 2013 log files from the German version of Wiktionary. We investigate several lexicographically relevant variables and their effect on look-up frequency: Corpus frequency of the headword seems to have a strong effect on the number of visits to a Wiktionary entry. We then consider the question of whether polysemic words are looked up more often than monosemic ones. Here, we also have to take into account that polysemic words are more frequent in most languages. Finally, we present a technique to investigate the time-course of look-up behaviour for specific entries. We exemplify the method by investigating influences of (temporary) social relevance of specific headwords.
This article reports about the on-going work on a new version of the metadata framework Component Metadata Infrastructure (CMDI), central to the CLARIN infrastructure. Version 1.2 introduces a number of important changes based on the experience gathered in the last five years of intensive use of CMDI by the digital humanities community, addressing problems encountered, but also introducing new functionality. Next to the consolidation of the structure of the model and schema sanity, new means for lifecycle management have been introduced aimed at combatting the observed proliferation of components, new mechanism for use of external vocabularies will contribute to more consistent use of controlled values and cues for tools will allow improved presentation of the metadata records to the human users. The feature set has been frozen and approved, and the infrastructure is now entering a transition phase, in which all the tools and data need to be migrated to the new version.
In a previous article (Faaß et al., 2012), a first attempt was made at documenting and encoding morphemic units of two South African Bantu languages, i.e. Northern Sotho and Zulu, with the aim of describing and storing the morphemic units of these two languages in a single relational database, structured as a hierarchical ontology. As a follow-up, the current article describes the implementation of our part-of-speech ontology. We give a detailed description of the morphemes and categories contained in the database, highlighting the need and reasons for a flexible ontology which will provide for both language specific and general linguistic information. By giving a detailed account of the methodology for the population of the database, we provide linguists from other Bantu languages with a road map for extending the database to also include their languages of specialization.
Dieser Band fasst die Vorträge des 9. Hildesheimer Evaluierungs- und Retrieval-Workshops (HIER) zusammen, der am 9. und 10. Juli 2015 an der Universität Hildesheim stattfand. Die HIER Workshop-Reihe begann im Jahr 2001 mit dem Ziel, die Forschungsergebnisse der Hildesheimer Informationswissenschaft zu präsentieren und zu diskutieren. Mittlerweile nehmen immer wieder Kooperationspartner von anderen Institutionen teil, was wir sehr begrüßen. HIER schafft auch ein Forum für Systemvorstellungen und praxisorientierte Beiträge.
We present a quantitative approach to disambiguating flat morphological analyses and producing more deeply structured analyses. Based on existing morphological segmentations, possible combinations of resulting word trees for the next level are filtered first by criteria of linguistic plausibility and then by weighting procedures based on the geometric mean. The frequencies for weighting are derived from three different sources (counts of morphs in a lexicon, counts of largest constituents in a lexicon, counts of token frequencies in a corpus) and can be used either to find the best analysis on the level of morphs or on the next higher constituent level. The evaluation shows that for this task corpus-based frequency counts are slightly superior to counts of lexical data.
In my article I argue the need for an existence of grammar in spoken language. It would have the same functions as the grammar of written language: describing and explaining the fundamental units of spoken language and their features, describing the composition of those units and their conjunction. The basic units in the grammar of spoken language can be named as: the sound, the word, the functional unit, the conversational turn and the conversation itself. Further the central characteristics of spoken language and their impact on grammar have to be taken into account. They are: the interactivity, the multimodality, the processabihty and the great variability. After displaying my concepts I discuss three alternative concepts of a grammar in spoken language: online-syntax, construction grammar and multimodal grammar. The article concludes by discussing the role of spoken language grammar in language and foreign language teaching.
Der vorliegende Beitrag erkundet den Zusammenhang zwischen der Komplexität politischer Argumentationsprozesse und der Diversifikation der Semantik von Schlüsselwörtern, deren Bedeutung im Argumentationsprozess umkämpft und in zahlreichen Facetten entfaltet widAdegenstand der Untersuchung ist die Verwendung von „Ökologie" in den Schlichtungsgesprächen zum Bahnprojekt Stuttgart 21. Im Unterscheid zu bisher vorliegenden Analysen zu semantischen Kämpfen geht es weniger darum, wie ein Ausdruck von einer Partei im Gegensatz zu anderen semantisiert wird. Es wird vielmehr gezeigt, wie semantische Diversifizierung und Ambiguität von „Ökologie" im expertischen Argumentationsprozess entstehen und welche kommunikativen Effekte dies für die Möglichkeit der Bürgerbeteiligung mit sich bringt. Es werden drei Praktiken identifiziert, mit denen die Interaktionsteilnehmer selbst auf semantische Diversifizierung und Ambiguität reagieren und versuchen, den Ausdruck eindeutig interpretierbar und die Quaestio entscheidbar zu machen: Strategieunterstellungen, Popularisierungen und Populismus. Die Interaktionsanalysen zeigen dabei, dass diese Praktiken selbst die Problematik, die sie lösen sollen, reproduzieren.
We investigate whether non-configurational languages, which display more word order variation than configurational ones, require more training data for a phenomenon to be parsed successfully. We perform a tightly controlled study comparing the dative alternation for English (a configurational language), German, and Russian (both non-configurational). More specifically, we compare the performance of a dependency parser when only canonical word order is present with its performance on data sets when all word orders are present. Our results show that for all languages, canonical data not only is easier to parse, but there exists no direct correspondence between the size of training sets containing free(er) word order variation and performance.
To optimize the sharing and reuse of existing data, many funding organizations now require researchers to specify a management plan for research data. In such a plan, researchers are supposed to describe the entire life cycle of the research data they are going to produce, from data creation to formatting, interpretation, documentation, short-term storage, long-term archiving and data re-use. To support researchers with this task, we built DMPTY, a wizard that guides researchers through the essential aspects of managing data, elicits information from them, and finally, generates a document that can be further edited and linked to the original research proposal.