TY - CPAPER U1 - Konferenzveröffentlichung A1 - Faaß, Gertrud A1 - Eckart, Kerstin ED - Gurevych, Iryna ED - Biemann, Chris ED - Zesch, Torsten T1 - SdeWaC – A corpus of parsable sentences from the web T2 - Language Processing and Knowledge in the Web: 25th International Conference, GSCL 2013, Darmstadt N2 - For a number of languages, web crawling allows researchers to collect huge text samples to build corpora. However, only part of the material found on the internet is useful for Natural Language Processing, as e.g. parsers typically cannot handle lists and tables, or very short or very long sentences. There are methods (cf. e.g. [3]) for cleaning the downloaded data before adding it to a corpus collection – but even when these are applied, not all remaining textual material might be suitable for certain research requirements. This paper describes methods utilized to prepare deWaC, a freely available German web corpus of the WaCky project, for automatic processing up to the parsing level. It then discusses ways in which this corpus, called SdeWaC, has been used since its release. T3 - Lecture Notes in Artificial Intelligence - 8105 KW - Korpus KW - Sprachverarbeitung KW - Computerlinguistik KW - Datenerfassung KW - Parser KW - World Wide Web Y1 - 2013 UN - https://nbn-resolving.org/urn:nbn:de:bsz:mh39-116253 SN - 1611-3349 SS - 1611-3349 SN - 978-3-642-40722-2 SB - 978-3-642-40722-2 U6 - https://doi.org/10.1007/978-3-642-40722-2_6 DO - https://doi.org/10.1007/978-3-642-40722-2_6 N1 - Dieser Beitrag ist aus urheberrechtlichen Gründen online nicht frei zugänglich. SP - 61 EP - 68 PB - Springer CY - Berlin/Heidelberg ER -