SdeWaC – A corpus of parsable sentences from the web
- For a number of languages, web crawling allows researchers to collect huge text samples to build corpora. However, only part of the material found on the internet is useful for Natural Language Processing, as e.g. parsers typically cannot handle lists and tables, or very short or very long sentences. There are methods (cf. e.g. [3]) for cleaning the downloaded data before adding it to a corpus collection – but even when these are applied, not all remaining textual material might be suitable for certain research requirements. This paper describes methods utilized to prepare deWaC, a freely available German web corpus of the WaCky project, for automatic processing up to the parsing level. It then discusses ways in which this corpus, called SdeWaC, has been used since its release.
Author: | Gertrud FaaßORCiD, Kerstin EckartGND |
---|---|
URN: | urn:nbn:de:bsz:mh39-116253 |
DOI: | https://doi.org/10.1007/978-3-642-40722-2_6 |
ISBN: | 978-3-642-40722-2 |
ISSN: | 1611-3349 |
Parent Title (English): | Language Processing and Knowledge in the Web: 25th International Conference, GSCL 2013, Darmstadt |
Series (Serial Number): | Lecture Notes in Artificial Intelligence (8105) |
Publisher: | Springer |
Place of publication: | Berlin/Heidelberg |
Editor: | Iryna Gurevych, Chris Biemann, Torsten Zesch |
Document Type: | Conference Proceeding |
Language: | English |
Year of first Publication: | 2013 |
Date of Publication (online): | 2023/04/26 |
Reviewstate: | Peer-Review |
GND Keyword: | Computerlinguistik; Datenerfassung; Korpus <Linguistik>; Parser; Sprachverarbeitung; World Wide Web |
First Page: | 61 |
Last Page: | 68 |
Note: | Dieser Beitrag ist aus urheberrechtlichen Gründen online nicht frei zugänglich. |
DDC classes: | 400 Sprache / 400 Sprache, Linguistik |
Open Access?: | nein |
Linguistics-Classification: | Korpuslinguistik |
Licence (German): | Urheberrechtlich geschützt |