Refine
Year of publication
Document Type
- Conference Proceeding (317) (remove)
Has Fulltext
- yes (317)
Is part of the Bibliography
- no (317) (remove)
Keywords
- Korpus <Linguistik> (112)
- Deutsch (66)
- Computerlinguistik (57)
- Annotation (35)
- Automatische Sprachanalyse (31)
- Metadaten (21)
- Natürliche Sprache (19)
- Datenmanagement (18)
- Information Extraction (18)
- Fremdsprachenlernen (17)
Publicationstate
- Veröffentlichungsversion (317) (remove)
Reviewstate
- Peer-Review (152)
- (Verlags)-Lektorat (110)
- Review-Status-unbekannt (5)
- Peer-review (2)
- Verlags-Lektorat (1)
Publisher
- Association for Computational Linguistics (33)
- European Language Resources Association (ELRA) (26)
- European Language Resources Association (18)
- Institut für Deutsche Sprache (16)
- International Speech Communication Association (9)
- Leibniz-Institut für Deutsche Sprache (7)
- Extreme Markup Languages Conference (6)
- CSLI Publications (5)
- LiU Electronic Press (5)
- Nisaba (5)
This paper presents a survey on hate speech detection. Given the steadily growing body of social media content, the amount of online hate speech is also increasing. Due to the massive scale of the web, methods that automatically detect hate speech are required. Our survey describes key areas that have been explored to automatically recognize these types of utterances using natural language processing. We also discuss limits of those approaches.
This paper presents a survey on the role of negation in sentiment analysis. Negation is a very common linguistic construction that affects polarity and, therefore, needs to be taken into consideration in sentiment analysis.
We will present various computational approaches modeling negation in sentiment analysis. We will, in particular, focus on aspects such as level of representation used for sentiment analysis, negation word detection and scope of negation. We will also discuss limits and challenges of negation modeling on that task.
The Manatee corpus management system on which the Sketch Engine is built is efficient, but unable to harness the power of today’s multiprocessor machines. We describe a new, compatible implementation of Manatee which we develop in the Go language and report on the performance gains that we obtained.
This paper presents Release 2.0 of the SALSA corpus, a German resource for lexical semantics. The new corpus release provides new annotations for German nouns, complementing the existing annotations of German verbs in Release 1.0. The corpus now includes around 24,000 sentences with more than 36,000 annotated instances. It was designed with an eye towards NLP applications such as semantic role labeling but will also be a useful resource for linguistic studies in lexical semantics.
We describe a simple and efficient Java object model and application programming interface (API) for (possibly multi-modal) annotated natural language corpora. Corpora are represented as elements like Sentences, Turns, Utterances, Words, Gestures and Markables. The API allows linguists to access corpora in terms of these discourse-level elements, i.e. at a conceptual level they are familiar with, with the flexibility offered by a general purpose programming language. It is also a contribution to corpus standardization efforts because it is based on a straightforward and easily extensible data model which can serve as a target for conversion of different corpus formats.
We investigate whether prototypicality or prominence of semantic roles can account for role-related effects in sentence interpretation. We present two acceptability-rating experiments testing three different constructions: active, personal passive and DO-clefts involving the same type of transitive verbs that differ with respect to the agentive role features they select. Our results reveal that there is no cross-constructional advantage for prototypical roles (e.g., agents), hence disconfirming a central tenet of role prototypicality. Rather, acceptability clines depend on the construction under investigation, thereby highlighting different role features. This finding is in line with one core assumption of the prominence account stating that role features are flexibly highlighted depending on the discourse function of the respective construction.
This paper aims to address these problems by dealing with theoretical and methodological questions concerning the national effects of the Bologna Process and the role national factors play in determining the impact of these effects. Altogether the purpose of the paper is to serve as a starting point for future research – both as a guide for systematic and comparative empirical work on higher education, but also for further theoretical and methodological reasoning concerning research on (higher) education policy. As higher education research so far particularly lacks an approach allowing for a competitive and systematic falsification of theoretical arguments by clearly indicating testable and specific hypothesis as well as variables behind the research design (Goedegebuure/Vught 1996) we propose to fall back on neighbouring disciplines, namely social science to improve and enhance the analysis (Slaughter 2001: 398; Altbach 2002: 154; Teichler 1996a: 433, 2005: 448). Several strands of research have to be considered – namely literature on Europeanization as well as insights and approaches of studies dealing with cross-national policy convergence. Taking into account the non-obligatory and mainly intergovernmental character of the Bologna Process the main focus of the paper is on factors related to the effects of transnational communication. The inherent goal is to extend the research agenda on higher education (McLendon 2003: 184ff) and to leave behind the restriction of to analyse only a few cases by striving for a research design that allows for systematic testing and sufficient explanations of cross-national policy convergence at the interface between the Bologna Process and domestic factors.
Annotating Discourse Relations in Spoken Language: A Comparison of the PDTB and CCR Frameworks
(2016)
In discourse relation annotation, there is currently a variety of different frameworks being used, and most of them have been developed and employed mostly on written data. This raises a number of questions regarding interoperability of discourse relation annotation schemes, as well as regarding differences in discourse annotation for written vs. spoken domains. In this paper, we describe ouron annotating two spoken domains from the SPICE Ireland corpus (telephone conversations and broadcast interviews) according todifferent discourse annotation schemes, PDTB 3.0 and CCR. We show that annotations in the two schemes can largely be mappedone another, and discuss differences in operationalisations of discourse relation schemes which present a challenge to automatic mapping. We also observe systematic differences in the prevalence of implicit discourse relations in spoken data compared to written texts,find that there are also differences in the types of causal relations between the domains. Finally, we find that PDTB 3.0 addresses many shortcomings of PDTB 2.0 wrt. the annotation of spoken discourse, and suggest further extensions. The new corpus has roughly theof the CoNLL 2015 Shared Task test set, and we hence hope that it will be a valuable resource for the evaluation of automatic discourse relation labellers.
Feedback utterances are among the most frequent in dialogue. Feedback is also a crucial aspect of all linguistic theories that take social interaction involving language into account. However, determining communicative functions is a notoriously difficult task both for human interpreters and systems. It involves an interpretative process that integrates various sources of information. Existing work on communicative function classification comes from either dialogue act tagging where it is generally coarse grained concerning the feed- back phenomena or it is token-based and does not address the variety of forms that feed- back utterances can take. This paper introduces an annotation framework, the dataset and the related annotation campaign (involving 7 raters to annotate nearly 6000 utterances). We present its evaluation not merely in terms of inter-rater agreement but also in terms of usability of the resulting reference dataset both from a linguistic research perspective and from a more applicative viewpoint.
In this paper, we investigate the practical applicability of Co-Training for the task of building a classifier for reference resolution. We are concerned with the question if Co-Training can significantly reduce the amount of manual labeling work and still produce a classifier with an acceptable performance.
This work exploited coarticulation and loud speech as natural sources of perturbation in order to determine whether articulatory covariation (motor equivalent behavior) can be observed inspeech that is not artificially perturbed. Articulatory analyses of jaw and tongue movement in the production of alveolar consonants by German speakers were performed. The sibilant /s/ shows virtually no articulatory covariation under the influence of natural perturbations, whereas other alveolar consonants show more obvious compensatory behavior. Our conclusion is that an effect of natural sources of perturbation is noticable, but sounds are affected to different degrees.
Our paper describes an experiment aimed to assessment of lexical coverage in web corpora in comparison with the traditional ones for two closely related Slavic languages from the lexicographers’ perspective. The preliminary results show that web corpora should not be considered ― inferior, but rather ― different.
Common Crawl is a considerably large, heterogeneous multilingual corpus comprised of crawled documents from the internet, surpassing 20TB of data and distributed as a set of more than 50 thousand plain text files where each contains many documents written in a wide variety of languages. Even though each document has a metadata block associated to it, this data lacks any information about the language in which each document is written, making it extremely difficult to use Common Crawl for monolingual applications. We propose a general, highly parallel, multithreaded pipeline to clean and classify Common Crawl by language; we specifically design it so that it runs efficiently on medium to low resource infrastructures where I/O speeds are the main constraint. We develop the pipeline so that it can be easily reapplied to any kind of heterogeneous corpus and so that it can be parameterised to a wide range of infrastructures. We also distribute a 6.3TB version of Common Crawl, filtered, classified by language, shuffled at line level in order to avoid copyright issues, and ready to be used for NLP applications.
The effect of manipulation of a speaker’s voice as well as exposure to a native speaker’s utterance was investigated regarding the pronunciation of stops by German learners of French. Three subject groups, a Control (CG), a Manipulation (MG), and a Native Speaker (NG) Group, were recorded on two subsequent days. The MG was presented with a manipulation of their voice on the second day and the NG listened to a native French speaker, while the CG did not receive any feedback. Results show that speakers of the MG and NG were able to extract useful information from the respective feedback and successfully adapted to it. Participants were able to reduce their voice onset time values, although speakers of the NG reduced it to a greater extent.