Refine
Year of publication
- 2016 (347) (remove)
Document Type
- Part of a Book (136)
- Article (104)
- Conference Proceeding (51)
- Book (33)
- Part of Periodical (12)
- Working Paper (5)
- Doctoral Thesis (3)
- Other (2)
- Preprint (1)
Keywords
- Deutsch (113)
- Korpus <Linguistik> (47)
- Gesprochene Sprache (31)
- Konversationsanalyse (24)
- Wörterbuch (22)
- Interaktion (20)
- Computerunterstützte Lexikographie (19)
- Linguistik (17)
- Diskursanalyse (16)
- Kommunikation (15)
Publicationstate
- Veröffentlichungsversion (169)
- Zweitveröffentlichung (35)
- Postprint (17)
- Erstveröffentlichung (1)
Reviewstate
Publisher
- Institut für Deutsche Sprache (45)
- de Gruyter (34)
- De Gruyter (23)
- Winter (19)
- European Language Resources Association (ELRA) (13)
- Narr Francke Attempto (12)
- Retorika (8)
- Peter Lang (7)
- Linssen Druckcenter (6)
- Association for Computational Linguistics (5)
This thesis consists of the following three papers that all have been published in international peer-reviewed journals:
Chapter 3: Koplenig, Alexander (2015c). The Impact of Lacking Metadata for the Measurement of Cultural and Linguistic Change Using the Google Ngram Data Sets—Reconstructing the Composition of the German Corpus in Times of WWII. Published in: Digital Scholarship in the Humanities. Oxford: Oxford University Press. [doi:10.1093/llc/fqv037]
Chapter 4: Koplenig, Alexander (2015b). Why the quantitative analysis of dia-chronic corpora that does not consider the temporal aspect of time-series can lead to wrong conclusions. Published in: Digital Scholarship in the Humanities. Oxford: Oxford University Press. [doi:10.1093/llc/fqv030]
Chapter 5: Koplenig, Alexander (2015a). Using the parameters of the Zipf–Mandelbrot law to measure diachronic lexical, syntactical and stylistic changes – a large-scale corpus analysis. Published in: Corpus Linguistics and Linguistic Theory. Berlin/Boston: de Gruyter. [doi:10.1515/cllt-2014-0049]
Chapter 1 introduces the topic by describing and discussing several basic concepts relevant to the statistical analysis of corpus linguistic data. Chapter 2 presents a method to analyze diachronic corpus data and a summary of the three publications. Chapters 3 to 5 each represent one of the three publications. All papers are printed in this thesis with the permission of the publishers.
In dem Beitrag stellt der Autor dar, welchen Einfluss die personale emotionale Identität auf die Kommunikation von Emotionen in der Interaktion hat (Abschnitt 5). Hierzu erläutert er zunächst, was er unter emotionaler Identität versteht (Abschnitt 2), um dann sein Verständnis von Erleben und Emotionen zu explizieren (Abschnitt 3) und Modellvorstellungen zur Kommunikation von Emotionen darzulegen (Abschnitt 4), die erforderlich sind, um den Einfluss der emotionalen Identität zu beschreiben.
Während lexikographische Prozesse, die zur Publikation gedruckter Wörterbücher führen, bereits seit einigen Jahrzehnten im Fokus der Wörterbuchforschung stehen und die dafür unterschiedenen Phasen der Vorbereitung, der Datenbeschaffung, der Datenaufbereitung, der Datenauswertung und der Satz- und Druckvorbereitung mittlerweile als etabliert betrachtet werden dürfen, steht die Diskussion und Beschreibung lexikographischer Prozesse von Internetwörterbüchern noch in den Anfängen. Zwar besteht kein Zweifel daran, dass sich lexikographische Prozesse bei der Publikation von Internetwörterbüchern anders gestalten als bei Printwörterbüchern, doch die Fragen, inwiefern sie dies tun, welchen Einfluss die neuen Möglichkeiten der Datengewinnung aus elektronischen Textkorpora auf die Prozesse haben, wie Bearbeitungsteilwortschätze auszuwählen sind, wie verschiedene Fassungen zu versionieren und zu archivieren sind und wie sich schließlich die Änderungen der lexikographischen Prozesse auf die Nutzer auswirken, ob und wie die Nutzer in diese Prozesse einbezogen werden können, sind noch nicht ausführlich beantwortet.
Diese und andere Fragen waren daher Gegenstand des vierten Arbeitstreffens des wissenschaftlichen Netzwerks “Internetlexikografie”, das am 22. und 23. November 2012 an der Universität Trier stattfand und vom Kompetenzzentrum für elektronische Erschließungs- und Publikationsverfahren in den Geisteswissenschaften/Trier Center for Digital Humanities organisiert wurde. Die Auseinandersetzung mit dem lexikographischen Prozess wurde fortgesetzt in drei Arbeitsgruppen, die sich mit Auswahlkriterien, Umsetzung und Problemen von Bearbeitungsteilwortschätzen, mit Archivierung und Versionierung und mit dem korpusbasierten Vorgehen bei der Erweiterung bestehender lexikographischer Ressourcen beschäftigten. Der vorliegende Band beschäftigt sich mit den in den Diskussionsrunden und Arbeitsgruppen gefundenen Ergebnissen und den dort identifizierten weiterführenden Fragen.
Diskurs
(2016)
Die linguistische Diskursanalyse untersucht die sprachliche Konstitution, die gesellschaftlichen Effekte und die Wissens- und Machtverhältnisse seriell-öffentlicher Kommunikation. Als breites, für Nachbardisziplinen offenes Forschungsfeld hat sie theoretische und methodische Schnittstellen etwa zur Textlinguistik, Pragmatik und Kognitiven Semantik sowie zur Wissenssoziologie, Literatur- und Geschichtswissenschaft. Die vorliegende Bibliografie führt konzise in Diskurs und Diskursanalyse ein und gibt einen Überblick über die interdisziplinären Bezüge der Forschungsliteratur.
Profit cumulat summierter Gewinn: wirtschaftssprachliche Kollokationen im Rumänischen und Deutschen
(2016)
Dieser Band ist ein sprachvergleichender Beitrag zur Kollokationsforschung. Exemplarisch wurden 3022 Kombinationen von 116 rumänischen Termini der Wirtschaftssprache exzerpiert, mit deutschen Kombinationen parallelisiert und statistisch ausgewertet. Eine Auswahl der auf dieser Basis gewonnenen rumänischen Kollokationen und deren deutsche Äquivalente wurden en detail linguistisch analysiert. Im Mittelpunkt standen dabei gemeinsprachliche Verben in fachsprachlichen Kollokationen.
Im Verlauf der Geschehnisse in der arabischen Welt seit 2011 gewann der Begriff Arabischer Frühling an Bedeutung und avancierte zum Leitausdruck des Diskurses. Der Beitrag geht den Fragen nach, wie der Begriff Arabischer Frühling in der deutschsprachigen Öffentlichkeit sprachlich realisiert, mit welchen sprachlichen Mitteln er konstruiert und mit welchen Ereignissen – zuweilen auch Katastrophen – er identifiziert wurde bzw. wird. Dabei wird auf die symbolische Funktion des Frühlings sowohl aus historischer Perspektive der Vormärzzeit als auch aus heutiger Sicht eingegangen. Im Blickfeld der Untersuchung stehen darüber hinaus die Jahreszeitenbezeichnungen Winter, Herbst und Sommer und ihr symbolisches Verhältnis zu den arabischen Revolutionen.
Nachruf auf Siegfried Grosse
(2016)
Objective: Discrimination against nonnative speakers is widespread and largely socially acceptable. Nonnative speakers are evaluated negatively because accent is a sign that they belong to an outgroup and because understanding their speech requires unusual effort from listeners. The present research investigated intergroup bias, based on stronger support for hierarchical relations between groups (social dominance orientation [SDO]), as a predictor of hiring recommendations of nonnative speakers.
Method: In an online experiment using an adaptation of the thin-slices methodology, 65 U.S. adults (54% women; 80% White; M[age] = 35.91, range = 18–67) heard a recording of a job applicant speaking with an Asian (Mandarin Chinese) or a Latino (Spanish) accent. Participants indicated how likely they would be to recommend hiring the speaker, answered questions about the text, and indicated how difficult it was to understand the applicant.
Results: Independent of objective comprehension, participants high in SDO reported that it was more difficult to understand a Latino speaker than an Asian speaker. SDO predicted hiring recommendations of the speakers, but this relationship was mediated by the perception that nonnative speakers were difficult to understand. This effect was stronger for speakers from lower status groups (Latinos relative to Asians) and was not related to objective comprehension.
Conclusions: These findings suggest a cycle of prejudice toward nonnative speakers: Not only do perceptions of difficulty in understanding cause prejudice toward them, but also prejudice toward low-status groups can lead to perceived difficulty in understanding members of these groups.
Smiling individuals are usually perceived more favorably than non-smiling ones—they are judged as happier, more attractive, competent, and friendly. These seemingly clear and obvious consequences of smiling are assumed to be culturally universal, however most of the psychological research is carried out in WEIRD societies (Western, Educated, Industrialized, Rich, and Democratic) and the influence of culture on social perception of nonverbal behavior is still understudied. Here we show that a smiling individual may be judged as less intelligent than the same non-smiling individual in cultures low on the GLOBE’s uncertainty avoidance dimension. Furthermore, we show that corruption at the societal level may undermine the prosocial perception of smiling—in societies with high corruption indicators, trust toward smiling individuals is reduced. This research fosters understanding of the cultural framework surrounding nonverbal communication processes and reveals that in some cultures smiling may lead to negative attributions.
Status und Gebrauch des Niederdeutschen 2016. Erste Ergebnisse einer repräsentativen Erhebung
(2016)
Wer versteht heute Plattdeutsch, und wer spricht es? Wer nutzt die plattdeutschen Medien- und Kulturangebote? Welche Vorstellungen verbinden die Menschen in Norddeutschland mit dem Niederdeutschen, und wie stehen sie zu ihrer Regionalsprache?
Diesen und weiteren Fragen widmet sich die vorliegende Broschüre mithilfe von repräsentativen Daten, die durch eine telefonische Befragung von insgesamt 1.632 Personen aus acht Bundesländern (Bremen, Hamburg, Mecklenburg-Vorpommern, Niedersachsen, Schleswig-Holstein sowie Brandenburg, Nordrhein-Westfalen und Sachsen-Anhalt) gewonnen wurden.
Stress that spills over into one's intimate relationship (Repetti, 1989) can increase negative behavior between partners (Repetti, 1989; Schulz et al., 2004), which in turn can negatively affect relationship outcomes, such as satisfaction (Karney and Bradbury, 1995; Randall and Bodenmann, 2016). This negative stress spillover process may, however, be mitigated if couples help each other cope with the experienced stress (i.e., dyadic coping). Although theoretical assumptions, such as the systematic-transactional model of stress and dyadic coping (Bodenmann, 2005), suggest that the association between coping behavior and relationship satisfaction is determined by cultural influences (e.g., gender roles), findings from a recent meta-analysis shows that this association is stable across nations and gender (Falconier et al., 2015). Despite the significant findings, the samples used in the meta-analysis nearly exclusively relied on couples living in Western culture (Falconier et al., 2015), which leaves an unanswered question about how culture may affect the association between dyadic coping and relationship satisfaction. The goal of the current paper was to examine the cultural influence in dyadic coping processes based on 7973 married individuals across 35 nations.
Politolinguistik
(2016)
Sprache ist das wichtigste Instrument politischen Handelns. Mit Hilfe von Sprache werden politische Handlungen vorbereitet, legitimiert und argumentativ ausgehandelt. Mit der Politolinguistik hat sich eine linguistische Teildisziplin etabliert, die ein Methodeninstrumentarium zur Verfügung stellt, um das Themenfeld Sprache und Politik in all seinen Facetten zu analysieren.
Der vorliegende Band der Reihe ‘Literaturhinweise zur Linguistik’ bietet neben einer Einführung in das Themengebiet Sprache und Politik eine strukturierte Auswahlbibliographie, die den neuesten Stand der politolinguistischen Forschung berücksichtigt. Neben den klassischen Themenfeldern wie der Wortsemantik, politischen Textsorten und Diskursanalyse liegt ein besonderer Schwerpunkt auf der Didaktisierung politolinguistischer Inhalte und deren methodischer Umsetzung im Kontext der Medienerziehung.
Im Folgenden werten wir Daten zu jeweils zehn Lexemen von mehreren hundert Sprechern aus, um die Realisierungspraxis von hiatfähigen Silbengrenzen im Gebrauchsstandard empirisch aus regionaler und sozialer Perspektive darzustellen. Wir stellen die Ergebnisse in Zusammenhang mit der Entstehung des Neuhochdeutschen unter Substratwirkung des Niederdeutschen und zeigen, was dies für die Hypothese eines typologischen Wandels des Deutschen von einer Silben- zu einer Wortsprache nach sich zieht.
A polarity-sensitive item (PSI), as traditionally defined, is an expression that is restricted to either an affirmative or negative context. PSIs like ‘lift a finger’ and ‘all the time in the world’ sub-serve discourse routines like understatement and emphasis. Lexical–semantic classes are increasingly invoked in descriptions of the properties of PSIs. Here, we use English corpus data and the tools of Frame Semantics (Fillmore, 1982, 1985) to explore Israel’s (2011) observation that the semantic role of a PSI determines how the expression fits into a contextually constructed scalar model. We focus on a class of exceptions implied by Israel’s model: cases in which a given PSI displays two countervailing patterns of polarity sensitivity, with attendant differences in scalar entailments. We offer a set of case studies of polaritysensitive expressions – including verbs of attraction and aversion like ‘can live without’, monetary units like ‘a red cent’, comparative adjectives and time-span adverbials – that demonstrate that the interpretation of a given PSI in a given polar context is based on multiple factors. These factors include the speaker’s perspective on and affective stance towards the described event, available inferences about causality and, perhaps most critically, particulars of the predication, including the verb or adjective’s frame membership, the presence or absence of an ability modal like can, the grammatical construction used and the range of contingencies evoked by the utterance.
In this paper, we describe preliminary results from an ongoing experiment wherein we classify two large unstructured text corpora—a web corpus and a newspaper corpus—by topic domain (or subject area). Our primary goal is to develop a method that allows for the reliable annotation of large crawled web corpora with meta data required by many corpus linguists. We are especially interested in designing an annotation scheme whose categories are both intuitively interpretable by linguists and firmly rooted in the distribution of lexical material in the documents. Since we use data from a web corpus and a more traditional corpus, we also contribute to the important field of corpus comparison and corpus evaluation. Technically, we use (unsupervised) topic modeling to automatically induce topic distributions over gold standard corpora that were manually annotated for 13 coarse-grained topic domains. In a second step, we apply supervised machine learning to learn the manually annotated topic domains using the previously induced topics as features. We achieve around 70% accuracy in 10-fold cross validations. An analysis of the errors clearly indicates, however, that a revised classification scheme and larger gold standard corpora will likely lead to a substantial increase in accuracy.
Die Erforschung der historischen Entwicklung des Deutschen in Luxemburg, vor allem im 19. Jahrhundert, ist ein Desiderat und sollte immer den Mehrsprachigkeitskontext berücksichtigen. Der vorliegende Beitrag entstammt dem Projekt "Standardization in Diversity. The case of German in Luxembourg (1795-1920)" und betrachtet den Gegenstand aus zwei Perspektiven: Einerseits wird ein umfangreiches Korpus von zweisprachigen öffentlichen Bekanntmachungen der Stadt Luxemburg sprachsystematisch exemplarisch anhand von zwei Phänomenen analysiert. Dabei lässt sich feststellen, dass grundsätzlich die Variation abnimmt, eine Annäherung an das reichsdeutsche Deutsche stattfindet und Hinweise auf die Mehrsprachigkeit verschwinden. Andererseits werden die für die Statusentwicklung
relevanten Dimensionen Sprachgebrauch, Sprachenpolitik sowie Sprachideologien auf der Basis von Protokollen der Parlamentsdebatten untersucht. Hier ist eher eine Präferenz der Frankophonie zu beobachten. Außerdem stellen Sprachwechsel zwischen Deutsch, Französisch und Luxemburgisch keine Seltenheit dar, sodass eine Zunahme mehrsprachiger
Praktiken konstatiert werden kann.
Interaktion ist ohne konkreten Raum- und Ortsbezug kaum denkbar. In der institutionellen Kommunikation kommt diese Raumbindung besonders prägnant zum Ausdruck, weil sich hier charakteristische Räume ausdifferenziert haben, in denen die Kommunikation ihr soziales Zuhause gefunden hat: Gottesdienst im Kirchenraum, Unterricht im Klassenzimmer, Ausstellungen im Museum oder die Produktion von Radiosendungen im Aufnahmestudio. Dieser Zusammenhang von Interaktion und gebautem Raum steht im Zentrum des Sammelbandes: Wie wird durch und mit Architektur Interaktion möglich und erwartbar gemacht (Interaktionsarchitektur)? Wie bringen die Beteiligten in ihrer Nutzung der Architektur ihr Alltagswissen über soziale Räume zum Ausdruck (Sozialtopographie)? Wie fließen diese Ressourcen in die Herstellung eines je konkreten Interaktionsraumes ein? Mit diesen Fragen ist ein vielversprechendes interdisziplinäres Forschungsfeld aufgespannt, das in empirischer, theoretischer und methodologischer Hinsicht erschlossen wird: mit Fallanalysen zu den genannten Räumen, mit Beiträgen zur Theorie und Methodologie und mit interdisziplinären Experten-Kommentaren.
Bild-Makros, auch unter dem Begriff Memes bekannt, sind populäre Internetphänomene, die im Zuge der umfassenden Multimodalisierung der Medienkommunikation als Unterhaltungsangebote auf Facebook verbreitet und kommentiert werden. Dieser Beitrag betrachtet diese aus einer Kombination von Bild und Text bestehenden multimodalen Kommunikate aus einer gattungs- und gesprächsanalytischen Perspektive, da Bild- Makros sowohl in ihrer formalen und semantischen Gestaltung als auch in der interaktiven Rezeption in Form von Kommentaren und Antworten verfestigte Muster aufzuweisen scheinen. In dieser medial vermittelten Interaktion haben sich sowohl auf der strukturellen Ebene der Interaktionssequenzen als auch innerhalb einzelner, auf sequenzexterner und sequenzinterner Ebene analysierten Interaktionseinheiten verschiedene kommunikative Muster herausgebildet. Darin nehmen soziale Prozesse wie face-work und Identitätskonstruktion Einfluss auf die interaktive Aushandlung des Kommunikats.
Interaktionsarchitektur und Sozialtopografie. Basiskonzepte einer interaktionistischen Raumanalyse
(2016)
Interaktion ist ohne konkreten Raum- und Ortsbezug kaum denkbar. In der institutionellen Kommunikation kommt diese Raumbindung besonders prägnant zum Ausdruck, weil sich hier charakteristische Räume ausdifferenziert haben, in denen die Kommunikation ihr soziales Zuhause gefunden hat: Gottesdienst im Kirchenraum, Unterricht im Klassenzimmer, Ausstellungen im Museum oder die Produktion von Radiosendungen im Aufnahmestudio. Dieser Zusammenhang von Interaktion und gebautem Raum steht im Zentrum des Sammelbandes: Wie wird durch und mit Architektur Interaktion möglich und erwartbar gemacht (Interaktionsarchitektur)? Wie bringen die Beteiligten in ihrer Nutzung der Architektur ihr Alltagswissen über soziale Räume zum Ausdruck (Sozialtopographie)? Wie fließen diese Ressourcen in die Herstellung eines je konkreten Interaktionsraumes ein? Mit diesen Fragen ist ein vielversprechendes interdisziplinäres Forschungsfeld aufgespannt, das in empirischer, theoretischer und methodologischer Hinsicht erschlossen wird: mit Fallanalysen zu den genannten Räumen, mit Beiträgen zur Theorie und Methodologie und mit interdisziplinären Experten-Kommentaren.
Konstanz, Wandel, Variation. Sprachkontakt in der Welt europäischer National- und Standardsprachen
(2016)
Verbformen
(2016)
It is widely assumed that there is a natural, prelinguistic conceptual domain of time whose linguistic organization is universally structured via metaphoric mapping from the lexicon and grammar of space and motion. We challenge this assumption on the basis of our research on the Amondawa (Tupi Kawahib) language and culture of Amazonia. Using both observational data and structured field linguistic tasks, we show that linguistic space-time mapping at the constructional level is not a feature of the Amondawa language, and is not employed by Amondawa speakers (when speaking Amondawa). Amondawa does not recruit its extensive inventory of terms and constructions for spatial motion and location to express temporal relations. Amondawa also lacks a numerically based calendric system. To account for these data, and in opposition to a Universal Space-Time Mapping Hypothesis, we propose a Mediated Mapping Hypothesis, which accords causal importance to the numerical and artefact-based construction of time-based (as opposed to event-based) time interval systems.
Beim Kontakt der substandardsprachlichen deutschen Varietäten, die von Aussiedlern der Einwanderungsgeneration aus deutschen Sprachinseln der ehemaligen Sowjetunion mitgebracht wurden, mit der Standardsprache und den binnendeutschen Regionalvarietäten ergeben sich Veränderungen spezifischer Art, wie sie im deutschsprachigen Raum bei einheimischen Dialektsprechern bei der Konvergenz infolge von Standard/Dialekt-Variation nicht vorliegen. Wenn Sprecher aus einer Sprachinsel kommen, dann aktivieren sie im Laufe des Aufenthaltes in Deutschland ihre Variationsmuster auf Grund der dialektalen Vorkenntnisse des Deutschen und weiten ihr Repertoire in den standardsprachlichen und zum Teil auch regionalsprachlichen Bereich des Deutschen aus. Diesem Prozess und seinen Folgen ist die vorliegende Publikation gewidmet.
Sentiment analysis has so far focused on the detection of explicit opinions. However, of late implicit opinions have received broader attention, the key idea being that the evaluation of an event type by a speaker depends on how the participants in the event are valued and how the event itself affects the participants. We present an annotation scheme for adding relevant information, couched in terms of so-called effect functors, to German lexical items. Our scheme synthesizes and extends previous proposals. We report on an inter-annotator agreement study. We also present results of a crowdsourcing experiment to test the utility of some known and some new functors for opinion inference where, unlike in previous work, subjects are asked to reason from event evaluation to participant evaluation.
Converting and Representing Social Media Corpora into TEI: Schema and best practices from CLARIN-D
(2016)
The paper presents results from a curation project within CLARIN-D, in which an existing lMWord corpus of German chat communication has been integrated into the DEREKO and DWDS corpus infrastructures of the CLARIN-D centres at the Institute for the German Language (IDS, Mannheim) and at the Berlin-Brandenburg Academy of Sciences (BBAW, Berlin). The focus is on the solutions developed for converting and representing the corpus in a TEI format.
TripleA is a workshop series founded by linguists from the University of Tübingen and the University of Potsdam. Its aim is to provide a forum for semanticists doing fieldwork on understudied languages, and its focus is on languages from Africa, Asia, Australia and Oceania. The second TripleA workshop was held at the University of Potsdam, June 3-5, 2015.
Sentence and construction types generally have more than one pragmatic function. Impersonal deontic declaratives such as ‘it is necessary to X’ assert the existence of an obligation or necessity without tying it to any particular individual. This family of statements can accomplish a range of functions, including getting another person to act, explaining or justifying the speaker’s own behavior as he or she undertakes to do something, or even justifying the speaker’s behavior while simultaneously getting another person to help. How is an impersonal deontic declarative fit for these different functions? And how do people know which function it has in a given context? The authors address these questions using video recordings of everyday interactions among speakers of Italian and Polish.
Von Gastarbeitern zu Transmigranten. Sprachliche Variation in deutsch-türkischen Lebenswelten
(2016)
The English language has taken advantage of the Digital Revolution to establish itself as the global language; however, only 28.6 %of Internet users speak English as their native language. Machine Trans-lation (MT) is a powerful technology that can bridge this gap. In devel-opment since the mid-20th century, MT has become available to every Internet user in the last decade, due to free online MT services. This paper aims to discuss the implications that these tools may have for the privacy of their users and how they are addressed by EU data protec-tion law. It examines the data-flows in respect of the initial processing (both from the perspective of the user and the MT service provider) and potential further processing that may be undertaken by the MT service provider.
The Shared Task on Source and Target Extraction from Political Speeches (STEPS) first ran in 2014 and is organized by the Interest Group on German Sentiment Analysis (IGGSA). This volume presents the proceedings of the workshop of the second iteration of the shared task. The workshop was held at KONVENS 2016 at Ruhr-University Bochum on September 22, 2016.
We present an empirical study addressing the question whether, and to which extent, lexicographic writing aids improve text revision results. German university students were asked to optimise two German texts using (1) no aids at all, (2) highlighted problems, or (3) highlighted problems accompanied by lexicographic resources that could be used to solve the specific problems. We found that participants from the third group corrected the largest number of problems and introduced the fewest semantic distortions during revision. Also, they reached the highest overall score and were most efficient (as measured in points per time). The second group with highlighted problems lies between the two other groups in almost every measure we analysed. We discuss these findings in the scope of intelligent writing environments, the effectiveness of writing aids in practical usage situations and teaching dictionary skills.
We present an approach to the new task of opinion holder and target extraction on opinion compounds. Opinion compounds (e.g. user rating or victim support) are noun compounds whose head is an opinion noun. We do not only examine features known to be effective for noun compound analysis, such as paraphrases and semantic classes of heads and modifiers, but also propose novel features tailored to this new task. Among them, we examine paraphrases that jointly consider holders and targets, a verb detour in which noun heads are replaced by related verbs, a global head constraint allowing inferencing between different compounds, and the categorization of the sentiment view that the head conveys.
We examine different features and classifiers for the categorization of opinion words into actor and speaker view. To our knowledge, this is the first comprehensive work to address sentiment views on the word level taking into consideration opinion verbs, nouns and adjectives. We consider many high-level features requiring only few labeled training data. A detailed feature analysis produces linguistic insights into the nature of sentiment views. We also examine how far global constraints between different opinion words help to increase classification performance. Finally, we show that our (prior) word-level annotation correlates with contextual sentiment views.
Overview of the IGGSA 2016 Shared Task on Source and Target Extraction from Political Speeches
(2016)
We present the second iteration of IGGSA’s Shared Task on Sentiment Analysis for German. It resumes the STEPS task of IGGSA’s 2014 evaluation campaign: Source, Subjective Expression and Target Extraction from Political Speeches. As before, the task is focused on fine-grained sentiment analysis, extracting sources and targets with their associated subjective expressions from a corpus of speeches given in the Swiss parliament. The second iteration exhibits some differences, however; mainly the use of an adjudicated gold standard and the availability of training data. The shared task had 2 participants submitting 7 runs for the full task and 3 runs for each of the subtasks. We evaluate the results and compare them to the baselines provided by the previous iteration. The shared task homepage can be found at http://iggsasharedtask2016.github.io/.
There is increasing interest in recognizing opinion inferences in addition to expressions of explicit sentiment. While different formalisms for representing inferential mechanisms are being developed and lexical resources are being built alongside, we here address the need for deeper investigation of the robustness of various aspects of opinion inference, performing crowdsourcing experiments with constructed stimuli as well as a corpus study of attested data.
Diese Handreichung stellt die Datenbank für Gesprochenes Deutsch (DGD) und speziell das Forschungs- und Lehrkorpus Gesprochenes Deutsch (FOLK) als Instrumente gesprächsanalytischer Arbeit vor. Nach einem kurzen einführenden Überblick werden anhand des Beispiels "sprich" als Diskursmarker bzw. Reformulierungsindikator Schritt für Schritt die Ressourcen und Tools für systematische korpus- und datenbankgesteuerte Recherchen und Analysen vorgestellt und illustriert.
Diese Handreichung stellt die Datenbank für Gesprochenes Deutsch (DGD) und speziell das Forschungs- und Lehrkorpus Gesprochenes Deutsch (FOLK) als Instrumente gesprächsanalytischer Arbeit vor. Nach einem kurzen einführenden Überblick werden anhand des Beispiels metapragmatischer Modalisierungen mit den Adverbien "sozusagen" und "gewissermaßen" und mit der Formel "in Anführungszeichen/-strichen" Schritt für Schritt die Ressourcen und Tools für systematische korpus- und datenbankgesteuerte Recherchen und Analysen vorgestellt und illustriert.
Diese Handreichung stellt die Datenbank für Gesprochenes Deutsch (DGD) und speziell das Forschungs- und Lehrkorpus Gesprochenes Deutsch (FOLK) als Instrumente gesprächsanalytischer Arbeit vor. Nach einem kurzen einführenden Überblick werden anhand vier verschiedener Beispiele Schritt für Schritt die Ressourcen und Tools für systematische korpus- und datenbankgesteuerte Recherchen und Analysen vorgestellt und illustriert.
We introduce our pipeline to integrate CMC and SM corpora into the CLARIN-D corpus infrastructure. The pipeline was developed by transforming an existing CMC corpus, the Dortmund Chat Corpus, into a resource conforming to current technical and legal standards. We describe how the resource has been prepared and restructured in terms of TEI encoding, linguistic annotations, and anonymisation. The output is a CLARIN-conformant resource integrated in the CLARIN-D research infrastructure.
The paper reports the results of the curation project ChatCorpus2CLARIN. The goal of the project was to develop a workflow and resources for the integration of an existing chat corpus into the CLARIN-D research infrastructure for language resources and tools in the Humanities and the Social Sciences (http://clarin-d.de). The paper presents an overview of the resources and practices developed in the project, describes the added value of the resource after its integration and discusses, as an outlook, to what extent these practices can be considered best practices which may be useful for the annotation and representation of other CMC and social media corpora.
The paper presents best practices and results from projects in four countries dedicated to the creation of corpora of computer-mediated communication and social media interactions (CMC). Even though there are still many open issues related to building and annotating corpora of that type, there already exists a range of accessible solutions which have been tested in projects and which may serve as a starting point for a more precise discussion of how future standards for CMC corpora may (and should) be shaped like.
Wiktionary is increasingly gaining influence in a wide variety of linguistic fields such as NLP and lexicography, and has great potential to become a serious competitor for publisher-based and academic dictionaries. However, little is known about the "crowd" that is responsible for the content of Wiktionary. In this article, we want to shed some light on selected questions concerning large-scale cooperative work in online dictionaries. To this end, we use quantitative analyses of the complete edit history files of the English and German Wiktionary language editions. Concerning the distribution of revisions over users, we show that — compared to the overall user base — only very few authors are responsible for the vast majority of revisions in the two Wiktionary editions. In the next step, we compare this distribution to the distribution of revisions over all the articles. The articles are subsequently analysed in terms of rigour and diversity, typical revision patterns through time, and novelty (the time since the last revision). We close with an examination of the relationship between corpus frequencies of headwords in articles, the number of article visits, and the number of revisions made to articles.
KorAP is a corpus search and analysis platform, developed at the Institute for the German Language (IDS). It supports very large corpora with multiple annotation layers, multiple query languages, and complex licensing scenarios. KorAP’s design aims to be scalable, flexible, and sustainable to serve the German Reference Corpus DEREKO for at least the next decade. To meet these requirements, we have adopted a highly modular microservice-based architecture. This paper outlines our approach: An architecture consisting of small components that are easy to extend, replace, and maintain. The components include a search backend, a user and corpus license management system, and a web-based user frontend. We also describe a general corpus query protocol used by all microservices for internal communications. KorAP is open source, licensed under BSD-2, and available on GitHub.
There have been several attempts to annotate communicative functions to utterances of verbal feedback in English previously. Here, we suggest an annotation scheme for verbal and non-verbal feedback utterances in French including the categories base, attitude, previous and visual. The data comprises conversations, maptasks and negotiations from which we extracted ca. 13,000 candidate feedback utterances and gestures. 12 students were recruited for the annotation campaign of ca. 9,500 instances. Each instance was annotated by between 2 and 7 raters. The evaluation of the annotation agreement resulted in an average best-pair kappa of 0.6. While the base category with the values acknowledgement, evaluation, answer, elicit and other achieves good agreement, this is not the case for the other main categories. The data sets, which also include automatic extractions of lexical, positional and acoustic features, are freely available and will further be used for machine learning classification experiments to analyse the form-function relationship of feedback.
The present paper describes Corpus Query Lingua Franca (ISO CQLF), a specification designed at ISO Technical Committee 37 Subcommittee 4 “Language resource management” for the purpose of facilitating the comparison of properties of corpus query languages. We overview the motivation for this endeavour and present its aims and its general architecture. CQLF is intended as a multi-part specification; here, we concentrate on the basic metamodel that provides a frame that the other parts fit in.
In order to develop its full potential, global communication needs linguistic support systems such as Machine Translation (MT). In the past decade, free online MT tools have become available to the general public, and the quality of their output is increasing. However, the use of such tools may entail various legal implications, especially as far as processing of personal data is concerned. This is even more evident if we take into account that their business model is largely based on providing translation in exchange for data, which can subsequently be used to improve the translation model, but also for commercial purposes. The purpose of this paper is to examine how free online MT tools fit in the European data protection framework, harmonised by the EU Data Protection Directive. The perspectives of both the user and the MT service provider are taken into account.
This contribution presents the background, design and results of a study of users of three oral corpus platforms in Germany. Roughly 5.000 registered users of the Database for Spoken German (DGD), the GeWiss corpus and the corpora of the Hamburg Centre for Language Corpora (HZSK) were asked to participate in a user survey. This quantitative approach was complemented by qualitative interviews with selected users. We briefly introduce the corpus resources involved in the study in section 2. Section 3 describes the methods employed in the user studies. Section 4 summarizes results of the studies focusing on selected key topics. Section 5 attempts a generalization of these results to larger contexts.
In this paper, we present a GOLD standard of part-of-speech tagged transcripts of spoken German. The GOLD standard data consists of four annotation layers – transcription (modified orthography), normalization (standard orthography), lemmatization and POS tags – all of which have undergone careful manual quality control. It comes with guidelines for the manual POS annotation of transcripts of German spoken data and an extended version of the STTS (Stuttgart Tübingen Tagset) which accounts for phenomena typically found in spontaneous spoken German. The GOLD standard was developed on the basis of the Research and Teaching Corpus of Spoken German, FOLK, and is, to our knowledge, the first such dataset based on a wide variety of spontaneous and authentic interaction types. It can be used as a basis for further development of language technology and corpus linguistic applications for German spoken language.
This paper presents C-WEP, the Collection of Writing Errors by Professionals Writers of German. It currently consists of 245 sentences with grammatical errors. All sentences are taken from published texts. All authors are professional writers with high skill levels with respect to German, the genres, and the topics. The purpose of this collection is to provide seeds for more sophisticated writing support tools as only a very small proportion of those errors can be detected by state-of-the-art checkers. C-WEP is annotated on various levels and freely available.
The CELEX database is one of the standard lexical resources for German. It yields a wealth of data especially for phonological and morphological applications. The morphological part comprises deep-structure morphological analyses of German. However, as it was developed in the Nineties, both encoding and spelling are outdated. About one fifth of over 50,000 datasets contain umlauts and signs such as ß. Changes to a modern version cannot be obtained by simple substitution. In this paper, we shortly describe the original content and form of the orthographic and morphological database for German in CELEX. Then we present our work on modernizing the linguistic data. Lemmas and morphological analyses are transferred to a modern standard of encoding by first merging orthographic and morphological information of the lemmas and their entries and then performing a second substitution for the morphs within their morphological analyses. Changes to modern German spelling are performed by substitution rules according to orthographical standards. We show an example of the use of the data for the disambiguation of morphological structures. The discussion describes prospects of future work on this or similar lexicons. The Perl script is publicly available on our website.
German research on collocation(s) focuses on many different aspects. A comprehensive documentation would be impossible in this short report. Accepting that we cannot do justice to all the contributions to this area, we just pick out some influential comerstones. This selection does not claim to be representative or balanced, but it follows the idea to constitute the backbone of the story we want to tell: Our ‘German’ view of the still ongoing evolution of a notion of ‘collocation’ Although our own work concerns the theoretical background of and the empirical rationale for collocations, lexicography occupies a large space. Some of the recent publications ( Wahrig 2008, Häcki Buhofer et al. 2014) represent a turn to the empirical legitimation for the selection of typical expressions. Nevertheless, linking the empirical evidence to the needs of an abstract lexicographic description (or a didactic format) is still an open issue.
Constructing a Corpus
(2016)
Researchers in Natural Language Processing rely on availability of data and software, ideally under open licenses, but little is done to actively encourage it. In fact, the current Copyright framework grants exclusive rights to authors to copy their works, make them available to the public and make derivative works (such as annotated language corpora). Moreover, in the EU databases are protected against unauthorized extraction and re-utilization of their contents. Therefore, proper public licensing plays a crucial role in providing access to research data. A public license is a license that grants certain rights not to one particular user, but to the general public (everybody). Our article presents a tool that we developed and whose purpose is to assist the user in the licensing process. As software and data should be licensed under different licenses, the tool is composed of two separate parts: Data and Software. The underlying logic as well as elements of the graphic interface are presented below.
On the basis of a law text corpus which consists of judicial decisions and jurisprudential papers on so-called assisted suicide from 1977 to 2011, agonal centres are determined within the paradigm of corpus-based pragma-semiotic text analysis. Agonal centres are defined as action-guiding concepts that are in conflict with each other concerning the general acceptance of event interpretations, options for actions, claims of validity, contextual knowledge and values. These action-guiding concepts are derived with the help of quantitative and qualitative methods. Discourse linguistic interpretations are thus rendered more objective with the help of semi-automatic methods; furthermore, specific discourse features of the discourse and approaches to interpretation can be derived from (un)expected linguistic significances of occurrence, distribution, frequency etc. at the linguistic surface. Finally, these agonal centres specific to the language of law are compared to agonal centres which are determined on the basis of a media corpus on the same issue. This provides a comparative insight into the constitution of a seemingly identical fact in everyday and special language, which demonstrates the sociopolitical relevance of analysing the constitution of reality as instructed by language.