Deduplication in large web corpora
- Our paper tries to find answers to some questions related to deduplication process in large-scale web-crawled corpora. An experiment based on eight corpora from the Aranea family is introduced, and first results are presented.
| Author: | Vladimír Benko |
|---|---|
| URN: | urn:nbn:de:bsz:mh39-90221 |
| DOI: | https://doi.org/10.14618/ids-pub-9022 |
| Parent Title (English): | Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019 |
| Publisher: | Leibniz-Institut für Deutsche Sprache |
| Place of publication: | Mannheim |
| Editor: | Piotr BańskiORCiDGND, Adrien BarbaresiGND, Hanno BiberGND, Evelyn BreitenederGND, Simon ClematideGND, Marc KupietzORCiDGND, Harald LüngenGND, Caroline Iliadi |
| Document Type: | Conference Proceeding |
| Language: | English |
| Year of first Publication: | 2019 |
| Date of Publication (online): | 2019/07/04 |
| Publicationstate: | Veröffentlichungsversion |
| Reviewstate: | Peer-Review |
| Tag: | corpus linguistics; corpus processing; deduplication; web corpora |
| GND Keyword: | Korpus <Linguistik> |
| First Page: | 17 |
| Last Page: | 21 |
| DDC classes: | 400 Sprache / 400 Sprache, Linguistik |
| Open Access?: | ja |
| Leibniz-Classification: | Sprache, Linguistik |
| Linguistics-Classification: | Korpuslinguistik |
| Conferences, Workshops: | Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019 |
| Licence (German): | Creative Commons - CC BY - Namensnennung 4.0 International |


