Volltext-Downloads (blau) und Frontdoor-Views (grau)

SdeWaC – A corpus of parsable sentences from the web

  • For a number of languages, web crawling allows researchers to collect huge text samples to build corpora. However, only part of the material found on the internet is useful for Natural Language Processing, as e.g. parsers typically cannot handle lists and tables, or very short or very long sentences. There are methods (cf. e.g. [3]) for cleaning the downloaded data before adding it to a corpus collection – but even when these are applied, not all remaining textual material might be suitable for certain research requirements. This paper describes methods utilized to prepare deWaC, a freely available German web corpus of the WaCky project, for automatic processing up to the parsing level. It then discusses ways in which this corpus, called SdeWaC, has been used since its release.

Download full text files

  • Faass_Eckart_SdeWaC_A_corpus_2013.pdf
    eng

    (IDS-intern)

Export metadata

Additional Services

Search Google Scholar

Statistics

frontdoor_oas
Metadaten
Author:Gertrud FaaßORCiD, Kerstin EckartGND
URN:urn:nbn:de:bsz:mh39-116253
DOI:https://doi.org/10.1007/978-3-642-40722-2_6
ISBN:978-3-642-40722-2
ISSN:1611-3349
Parent Title (English):Language Processing and Knowledge in the Web: 25th International Conference, GSCL 2013, Darmstadt
Series (Serial Number):Lecture Notes in Artificial Intelligence (8105)
Publisher:Springer
Place of publication:Berlin/Heidelberg
Editor:Iryna Gurevych, Chris Biemann, Torsten Zesch
Document Type:Conference Proceeding
Language:English
Year of first Publication:2013
Date of Publication (online):2023/04/26
Reviewstate:Peer-Review
GND Keyword:Computerlinguistik; Datenerfassung; Korpus <Linguistik>; Parser; Sprachverarbeitung; World Wide Web
First Page:61
Last Page:68
Note:
Dieser Beitrag ist aus urheberrechtlichen Gründen online nicht frei zugänglich.
DDC classes:400 Sprache / 400 Sprache, Linguistik
Open Access?:nein
Linguistics-Classification:Korpuslinguistik
Licence (German):License LogoUrheberrechtlich geschützt