OPUS 4 | Search

135 search hits

1 to 10

Sort by

Linguistic Landscape und Fremdsprachendidaktik. Perspektiven für die Sprach-, Kultur und Literaturdidaktik (Linguistic landscape and foreign language didactics. Perspectives for language, cultural and literary didactics). Edited by Camilla Badstübner-Kizik & Věra Janíková. Berlin: Peter Lang, 2018, 359 pp. (Posener Beiträge zur Angewandten Linguistik 10). ISBN: 9783631773543. EUR 56,10/GBP 46,00/USD 67,95. [Rezension] (2019)

Marten, Heiko F.

Der Mann, wo ich gesehen habe - das relative wo (2019)

Mösch, Matthias

Er hängte seinen Mantel an den Haken, und dort hing er den ganzen Tag — schwache und starke Flexion und Bedeutungsunterschiede (aus: Grammatik in Fragen und Antworten) (2019)

Kubczak, Jacqueline

Festakt zum 90. Geburtstag von Prof. Dr. Dr. h.c. mult. Ulrich Engel (2019)

Taborek, Janusz

Laudatio auf Christian Fandrych (2019)

Nübling, Damaris

Thilo Weber. 2017. Die TUN-Periphrase im Niederdeutschen. Funktionale und formale Aspekte (Studien zur deutschen Grammatik 94). Tübingen: Stauffenburg. 418 S. [Rezension] (2019)

Berg, Kristian

Complex Lexical Units. Compounds and Multi-Word Expressions (2019)

Both compounds and multi-word expressions are complex lexical units, made up of at least two constituents. The most basic difference is that the former are morphological objects and the latter result from syntactic processes. However, the exact demarcation between compounds and multi-word expressions differs greatly from language to language and is often a matter of debate in and across languages. Similarly debated is whether and how these two different kinds of units complement or compete with each other. The volume presents an overview of compounds and multi-word expressions in a variety of European languages. Central questions that are discussed for each language concern the formal distinction between compounds and multi-word expressions, their formation and their status in lexicon and grammar. The volume contains chapters on German, English, Dutch, French, Italian, Spanish, Greek, Russian, Polish, Finnish, and Hungarian as well as a contrastive overview with a focus on German. It brings together insights from word-formation theory, phraseology and theory of grammar and aims to contribute to the understanding of the lexicon, both from a language-specific and cross-linguistic perspective.

Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures (2019)

Ortiz Suárez, Pedro Javier ; Sagot, Benoît ; Romary, Laurent

Common Crawl is a considerably large, heterogeneous multilingual corpus comprised of crawled documents from the internet, surpassing 20TB of data and distributed as a set of more than 50 thousand plain text files where each contains many documents written in a wide variety of languages. Even though each document has a metadata block associated to it, this data lacks any information about the language in which each document is written, making it extremely difficult to use Common Crawl for monolingual applications. We propose a general, highly parallel, multithreaded pipeline to clean and classify Common Crawl by language; we specifically design it so that it runs efficiently on medium to low resource infrastructures where I/O speeds are the main constraint. We develop the pipeline so that it can be easily reapplied to any kind of heterogeneous corpus and so that it can be parameterised to a wide range of infrastructures. We also distribute a 6.3TB version of Common Crawl, filtered, classified by language, shuffled at line level in order to avoid copyright issues, and ready to be used for NLP applications.

Modelling large parallel corpora. The Zurich Parallel Corpus Collection (2019)

Graën, Johannes ; Kew, Tannon ; Shaitarova, Anastassia ; Volk, Martin

Text corpora come in many different shapes and sizes and carry heterogeneous annotations, depending on their purpose and design. The true benefit of corpora is rooted in their annotation and the method by which this data is encoded is an important factor in their interoperability. We have accumulated a large collection of multilingual and parallel corpora and encoded it in a unified format which is compatible with a broad range of NLP tools and corpus linguistic applications. In this paper, we present our corpus collection and describe a data model and the extensions to the popular CoNLL-U format that enable us to encode it.

The Vast and the Focused: On the need for domain-focused web corpora (2019)

Barbaresi, Adrien

As the Web ought to be considered as a series of sources rather than as a source in itself, a problem facing corpus construction resides in meta-information and categorization. In addition, we need focused data to shed light on particular subfields of the digital public sphere. Blogs are relevant to that end, especially if the resulting web texts can be extracted along with metadata and made available in coherent and clearly describable collections.

1 to 10

Open Access

Refine

Author

Year of publication

Document Type

Language

Has Fulltext

Is part of the Bibliography

Keywords

Publicationstate

Reviewstate

Publisher

135 search hits