The best of both worlds: Multi-billion word “dynamic” corpora

Nearly all of the very large corpora of English are “static”, which allows a wide range of one-time, pre-processed data, such as collocates. The challenge comes with large “dynamic” corpora, which are updated regularly, and where preprocessing is much more difficult. This paper provides an overview of the NOW corpus (News on the Web), which is currently 8.2 billion words in size, and which grows by about 170 million words each month. We discuss the architecture of NOW, and provide many examples that show how data from NOW can (uniquely) be extracted to look at a wide range of ongoing changes in English.

Metadaten
Author:	Mark Davies GND
URN:	urn:nbn:de:bsz:mh39-90234
DOI:	https://doi.org/10.14618/ids-pub-9023
Parent Title (English):	Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019
Publisher:	Leibniz-Institut für Deutsche Sprache
Place of publication:	Mannheim
Editor:	Piotr Bański, Adrien Barbaresi, Hanno Biber, Evelyn Breiteneder, Simon Clematide, Marc Kupietz, Harald Lüngen, Caroline Iliadi
Document Type:	Conference Proceeding
Language:	English
Year of first Publication:	2019
Date of Publication (online):	2019/07/04
Publicationstate:	Veröffentlichungsversion
Reviewstate:	Peer-Review
Tag:	corpus linguistics; corpus processing; web corpora
GND Keyword:	Korpus <Linguistik>
First Page:	23
Last Page:	28
DDC classes:	400 Sprache / 400 Sprache, Linguistik
Open Access?:	ja
Leibniz-Classification:	Sprache, Linguistik
Linguistics-Classification:	Korpuslinguistik
Conferences, Workshops:	Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019
Licence (German):	Creative Commons - CC BY - Namensnennung 4.0 International

Open Access