OPUS 4 | Search

41 search hits

1 to 10

Sort by

Year
Year
Title
Title
Author
Author

Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus (2021)

Abadji, Julien ; Ortiz Suárez, Pedro Javier ; Romary, Laurent ; Sagot, Benoît

Since the introduction of large language models in Natural Language Processing, large raw corpora have played a crucial role in Computational Linguistics. However, most of these large raw corpora are either available only for English or not available to the general public due to copyright issues. Nevertheless, there are some examples of freely available multilingual corpora for training Deep Learning NLP models, such as the OSCAR and Paracrawl corpora. However, they have quality issues, especially for low-resource languages. Moreover, recreating or updating these corpora is very complex. In this work, we try to reproduce and improve the goclassy pipeline used to create the OSCAR corpus. We propose a new pipeline that is faster, modular, parameterizable, and well documented. We use it to create a corpus similar to OSCAR but larger and based on recent data. Also, unlike OSCAR, the metadata information is at the document level. We release our pipeline under an open source license and publish the corpus under a research-only license.

Preface (2019)

Alfter, David ; Volodina, Elena ; Borin, Lars ; Pilán, Ildikó ; Lange, Herbert

Preface (2020)

Alfter, David ; Volodina, Elena ; Pilán, Ildikó ; Lange, Herbert ; Borin, Lars

The Naproche Project. Controlled Natural Language Proof Checking of Mathematical Texts (2010)

Cramer, Marcos ; Fisseni, Bernhard ; Koepke, Peter ; Kühlwein, Daniel ; Schröder, Bernhard ; Veldman, Jip

This paper discusses the semi-formal language of mathematics and presents the Naproche CNL, a controlled natural language for mathematical authoring. Proof Representation Structures, an adaptation of Discourse Representation Structures, are used to represent the semantics of texts written in the Naproche CNL. We discuss how the Naproche CNL can be used in formal mathematics, and present our prototypical Naproche system, a computer program for parsing texts in the Naproche CNL and checking the proofs in them for logical correctness.

ProofML - eine Annotationssprache für natürlichsprachige mathematische Beweise (2005)

Fisseni, Bernhard

Mathematische Texte sind natürlichsprachlich (ggf. mit formelsprachliche Anteilen), nicht formalsprachlich. ProofML ist ein Dateiformat, das erlaubt, sie so zu annotieren, daß der natürlichsprachlichen eine logische Struktur zugeordnet wird.

Are there compensatory effects in natural speech? (1999)

Geumann, Anja ; Kroos, Christian ; Tillmann, Hans G.

This work exploited coarticulation and loud speech as natural sources of perturbation in order to determine whether articulatory covariation (motor equivalent behavior) can be observed inspeech that is not artificially perturbed. Articulatory analyses of jaw and tongue movement in the production of alveolar consonants by German speakers were performed. The sibilant /s/ shows virtually no articulatory covariation under the influence of natural perturbations, whereas other alveolar consonants show more obvious compensatory behavior. Our conclusion is that an effect of natural sources of perturbation is noticable, but sounds are affected to different degrees.

Implementation of a Latin grammar in grammatical framework (2017)

Lange, Herbert

In this paper we present work in developing a computerized grammar for the Latin language. It demonstrates the principles and challenges in developing a grammar for a natural language in a modern grammar formalism. The grammar presented here provides a useful resource for natural language processing applications in different fields. It can be easily adopted for language learning and use in language technology for Cultural Heritage like translation applications or to support post-correction of document digitization.

An API for discourse-level access to XML-encoded corpora (2002)

Müller, Mark-Christoph ; Strube, Michael

We describe a simple and efficient Java object model and application programming interface (API) for (possibly multi-modal) annotated natural language corpora. Corpora are represented as elements like Sentences, Turns, Utterances, Words, Gestures and Markables. The API allows linguists to access corpora in terms of these discourse-level elements, i.e. at a conceptual level they are familiar with, with the flexibility offered by a general purpose programming language. It is also a contribution to corpus standardization efforts because it is based on a straightforward and easily extensible data model which can serve as a target for conversion of different corpus formats.

A New Resource for German Causal Language (2020)

Rehbein, Ines ; Ruppenhofer, Josef

We present a new resource for German causal language, with annotations in context for verbs, nouns and adpositions. Our dataset includes 4,390 annotated instances for more than 150 different triggers. The annotation scheme distinguishes three different types of causal events (CONSEQUENCE, MOTIVATION, PURPOSE). We also provide annotations for semantic roles, i.e. of the cause and effect for the causal event as well as the actor and affected party, if present. In the paper, we present inter-annotator agreement scores for our dataset and discuss problems for annotating causal language. Finally, we present experiments where we frame causal annotation as a sequence labelling problem and report baseline results for the prediciton of causal arguments and for predicting different types of causation.

Relation Extraction for the Food Domain without Labeled Training Data – Is Distant Supervision the Best Solution? (2014)

Reiplinger, Melanie ; Wiegand, Michael ; Klakow, Dietrich

We examine the task of relation extraction in the food domain by employing distant supervision. We focus on the extraction of two relations that are not only relevant to product recommendation in the food domain, but that also have significance in other domains, such as the fashion or electronics domain. In order to select suitable training data, we investigate various degrees of freedom. We consider three processing levels being argument level, sentence level and feature level. As external resources, we employ manually created surface patterns and semantic types on all these levels. We also explore in how far rule-based methods employing the same information are competitive.

1 to 10

Open Access

Refine

Author

Year of publication

Document Type

Language

Has Fulltext

Is part of the Bibliography

Keywords

Publicationstate

Reviewstate

Publisher

41 search hits