OPUS 4 | CMLC-5 + BigNLP / 5th Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing

CMLC-5 + BigNLP / 5th Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing

3 search hits

1 to 3

Sort by

Are web corpora inferior? The Case of Czech and Slovak (2017)

Our paper describes an experiment aimed to assessment of lexical coverage in web corpora in comparison with the traditional ones for two closely related Slavic languages from the lexicographers’ perspective. The preliminary results show that web corpora should not be considered ― inferior, but rather ― different.

Removing spam from web corpora through supervised learning using FastText (2017)

Suchomel, Vít

Unlike traditional text corpora collected from trustworthy sources, the content of web based corpora has to be filtered. This study briefly discusses the impact of web spam on corpus usability and emphasizes the importance of removing computer generated text from web corpora. The paper also presents a keyword comparison of an unfiltered corpus with the same collection of texts cleaned by a supervised classifier trained using FastText. The classifier was able to recognize 71% of web spam documents similar to the training set but lacked both precision and recall when applied to short texts from another data set.

Web corpora - the best possible solution for tracking rare phenomena in underresourced languages: clitics in Bosnian, Croatian and Serbian (2017)

Jurkiewicz-Rohrbacher, Edyta ; Kolaković, Zrinka ; Hansen, Björn

Complex linguistic phenomena, such as Clitic Climbing in Bosnian, Croatian and Serbian, are often described intuitively, only from the perspective of the main tendency. In this paper, we argue that web corpora currently offer the best source of empirical material for studying Clitic Climbing in BCS. They thus allow the most accurate description of this phenomenon, as less frequent constructions can be tracked only in big, well-annotated data sources. We compare the properties of web corpora for BCS with traditional sources and give examples of studies on CC based on web corpora. Furthermore, we discuss problems related to web corpora and suggest some improvements for the future.

1 to 3

Open Access

CMLC-5 + BigNLP / 5th Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing

Refine

Author

Year of publication

Document Type

Language

Has Fulltext

Is part of the Bibliography

Keywords

Publicationstate

Reviewstate

Publisher

3 search hits