OPUS 4 | Search

A Survey on Hate Speech Detection using Natural Language Processing (2017)

This paper presents a survey on hate speech detection. Given the steadily growing body of social media content, the amount of online hate speech is also increasing. Due to the massive scale of the web, methods that automatically detect hate speech are required. Our survey describes key areas that have been explored to automatically recognize these types of utterances using natural language processing. We also discuss limits of those approaches.

Accelerating corpus search using multiple cores (2017)

Rábara, Radoslav ; Rychlý, Pavel ; Herman, Ondřej ; Jakubíček, Miloš

The Manatee corpus management system on which the Sketch Engine is built is efficient, but unable to harness the power of today’s multiprocessor machines. We describe a new, compatible implementation of Manatee which we develop in the Go language and report on the performance gains that we obtained.

Are web corpora inferior? The Case of Czech and Slovak (2017)

Benko, Vladimír

Our paper describes an experiment aimed to assessment of lexical coverage in web corpora in comparison with the traditional ones for two closely related Slavic languages from the lexicographers’ perspective. The preliminary results show that web corpora should not be considered ― inferior, but rather ― different.

Creating CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes - The National Corpus of Contemporary Welsh) (2017)

Knight, Dawn ; Fitzpatrick, Tess ; Morris, Steve ; Evas, Jeremy ; Rayson, Paul ; Spasić, Irena ; Stonelake, Mark ; Thomas, Enlli Môn ; Neale, Steven ; Needs, Jennifer ; Piao, Scott ; Rees, Mair ; Watkins, Gareth ; Anthony, Laurence ; Cobb, Thomas Michael ; Deuchar, Margaret ; Donnelly, Kevin ; McCarthy, Michael ; Scannell, Kevin

CorCenCC is an interdisciplinary and multiinstitutional project that is creating a large-scale, open-source corpus of contemporary Welsh. CorCenCC will be the first ever large-scale corpus to represent spoken, written and electronicallymediated Welsh (compiling an initial data set of 10 million Welsh words), with a functional design informed, from the outset, by representatives of all anticipated academic and community user groups.

Data point selection for genre-aware parsing (2017)

Rehbein, Ines ; Bildhauer, Felix

In the NLP literature, adapting a parser to new text with properties different from the training data is commonly referred to as domain adaptation. In practice, however, the differences between texts from different sources often reflect a mixture of domain and genre properties, and it is by no means clear what impact each of those has on statistical parsing. In this paper, we investigate how differences between articles in a newspaper corpus relate to the concepts of genre and domain and how they influence parsing performance of a transition-based dependency parser. We do this by applying various similarity measures for data point selection and testing their adequacy for creating genre-aware parsing models.

From ICE to ICC: The new International Comparable Corpus (2017)

Kirk, John ; Čermáková, Anna

This paper outlines the broad research context and rationale for a new international comparable corpus (ICC). The ICC is to be largely modelled on the text categories and their quantities the International Corpus of English with only a few changes. The corpus will initially begin with nine European languages but others may join in due course. The paper reports on those and other agreements made at the inaugural planning meeting in Prague on 22-23 June 2017. It also sets out the project’s goals for its first two years.

Implementation of a Latin grammar in grammatical framework (2017)

Lange, Herbert

In this paper we present work in developing a computerized grammar for the Latin language. It demonstrates the principles and challenges in developing a grammar for a natural language in a modern grammar formalism. The grammar presented here provides a useful resource for natural language processing applications in different fields. It can be easily adopted for language learning and use in language technology for Cultural Heritage like translation applications or to support post-correction of document digitization.

Intra-connecting an exemplary literary corpus with semantic web technologies for exploratory literary studies (2017)

Dittrich, Andreas

Many (modernist) works of literature can be understood by their associativeness, be it constructed or “free”. This network-like character of (modernist) literature has often been addressed by terms like “free association”, connotation”, “context” or “intertext”. This paper proposes an experimental and exemplary approach to intraconnect a literary corpus of the Austrian writer Ilse Aichinger with semantic web-technologies to enable interactive explorations of word-associations.

Keeping Properties with the Data CL-MetaHeaders - An Open Specification (2017)

Vidler, John ; Wattam, Stephen

Corpus researchers, along with many other disciplines in science are being put under continual pressure to show accountability and reproducibility in their work. This is unsurprisingly difficult when the researcher is faced with a wide array of methods and tools through which to do their work; simply tracking the operations done can be problematic, especially when toolchains are often configured by the developers, but left largely as a black box to the user. Here we present a scheme for encoding this ‘meta data’ inside the corpus files themselves in a structured data format, along with a proof-of-concept tool to record the operations performed on a file.

Language Independent Named Entity Recognition using Distant Supervision (2017)

Dembowski, Julia ; Wiegand, Michael ; Klakow, Dietrich

While good results have been achieved for named entity recognition (NER) in supervised settings, it remains a problem that for low resource languages and less studied domains little or no labelled data is available. As NER is a crucial preprocessing step for many natural language processing tasks, finding a way to overcome this deficit in data remains of great interest. We propose a distant supervision approach to NER that is both language and domain independent where we automatically generate labelled training data using gazetteers that we previously extracted from Wikipedia. We test our approach on English, German and Estonian data sets and contribute further by introducing several successful methods to reduce the noise in the generated training data. The tested models beat baseline systems and our results show that distant supervision can be a promising approach for NER when no labelled data is available. For the English model we also show that the distant supervision model is better at generalizing within the same domain of news texts by comparing it against a supervised model on a different test set.

Organizing corpora at the Stanford Literary Lab. Balancing simplicity and flexibility in metadata management (2017)

McClure, David ; Algee-Hewitt, Mark ; Douris, Steele ; Fredner, Erik ; Walser, Hannah

This article describes a series of ongoing efforts at the Stanford Literary Lab to manage a large collection of literary corpora (~40 billion words). This work is marked by a tension between two competing requirements – the corpora need to be merged together into higher-order collections that can be analyzed as units; but, at the same time, it’s also necessary to preserve granular access to the original metadata and relational organization of each individual corpus. We describe a set of data management practices that try to accommodate both of these requirements – Apache Spark is used to index data as Parquet tables on an HPC cluster at Stanford. Crucially, the approach distinguishes between what we call “canonical” and “combined” corpora, a variation on the well-established notion of a “virtual corpus” (Kupietz et al., 2014; Jakubíek et al., 2014; van Uytvanck, 2010).

Practice Report. A blended learning approach to teaching NLP for a DH public (2017)

Faaß, Gertrud ; Heid, Ulrich

This paper reports about current practice in a staged approach to the introduction of NLP principles and techniques for students of information science (IIM) and of international communication and translation (ICT) as part of their curricula. As most of these students are rather not familiar with computer science or, in the case of IIM students, linguistics, we see them as comparable with students of the humanities. We follow a blended learning strategy with lectures, online materials, tutorials, and screencasts. In the first two terms, we focus on linguistics and its formalisation, NLP tools and applications are then introduced from the third term on. The lectures are combined with tutorials and - since the summer term 2017 - with a set of screencasts.

Removing spam from web corpora through supervised learning using FastText (2017)

Suchomel, Vít

Unlike traditional text corpora collected from trustworthy sources, the content of web based corpora has to be filtered. This study briefly discusses the impact of web spam on corpus usability and emphasizes the importance of removing computer generated text from web corpora. The paper also presents a keyword comparison of an unfiltered corpus with the same collection of texts cleaned by a supervised classifier trained using FastText. The classifier was able to recognize 71% of web spam documents similar to the training set but lacked both precision and recall when applied to short texts from another data set.

Semantic author name disambiguation with word embeddings (2017)

Müller, Mark-Christoph

We present a supervised machine learning AND system which tackles semantic similarity between publication titles by means of word embeddings. Word embeddings are integrated as external components, which keeps the model small and efficient, while allowing for easy extensibility and domain adaptation. Initial experiments show that word embeddings can improve the Recall and F score of the binary classification sub-task of AND. Results for the clustering sub-task are less clear, but also promising and overall show the feasibility of the approach.

Taking typography to experimental testing: On the influence of serifs, fonts and justification on eye movements in text reading (2017)

Jarosch, Julian ; Schlesewsky, Matthias ; Füssel, Stephan ; Kretzschmar, Franziska

The two sides of prediction error in reading: on the relationship between eye movements and the N400 in sentence processing (2017)

Kretzschmar, Franziska ; Alday, Phillip M.

Typography and individual experience in digital reading: Do readers’ eye movements adapt to poor justification? (2017)

Jarosch, Julian ; Schlesewsky, Matthias ; Füssel, Stephan ; Kretzschmar, Franziska

Web corpora - the best possible solution for tracking rare phenomena in underresourced languages: clitics in Bosnian, Croatian and Serbian (2017)

Jurkiewicz-Rohrbacher, Edyta ; Kolaković, Zrinka ; Hansen, Björn

Complex linguistic phenomena, such as Clitic Climbing in Bosnian, Croatian and Serbian, are often described intuitively, only from the perspective of the main tendency. In this paper, we argue that web corpora currently offer the best source of empirical material for studying Clitic Climbing in BCS. They thus allow the most accurate description of this phenomenon, as less frequent constructions can be tracked only in big, well-annotated data sources. We compare the properties of web corpora for BCS with traditional sources and give examples of studies on CC based on web corpora. Furthermore, we discuss problems related to web corpora and suggest some improvements for the future.

When readers pay attention to the left: A concurrent eyetracking-fMRI investigation on the neuronal correlates of regressive eye movements during reading (2017)

Weiß, Anna Fiona ; Kretzschmar, Franziska ; Nagels, Arne ; Schlesewsky, Matthias ; Bornkessel-Schlesewsky, Ina ; Tune, Sarah

Open Access

Refine

Author

Year of publication

Document Type

Language

Has Fulltext

Is part of the Bibliography

Keywords

Publicationstate

Reviewstate

Publisher

19 search hits