The Vast and the Focused: On the need for domain-focused web corpora

As the Web ought to be considered as a series of sources rather than as a source in itself, a problem facing corpus construction resides in meta-information and categorization. In addition, we need focused data to shed light on particular subfields of the digital public sphere. Blogs are relevant to that end, especially if the resulting web texts can be extracted along with metadata and made available in coherent and clearly describable collections.

Metadaten
Author:	Adrien Barbaresi GND
URN:	urn:nbn:de:bsz:mh39-90257
DOI:	https://doi.org/10.14618/ids-pub-9025
Parent Title (English):	Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019
Publisher:	Leibniz-Institut für Deutsche Sprache
Place of publication:	Mannheim
Editor:	Piotr Bański, Adrien Barbaresi, Hanno Biber, Evelyn Breiteneder, Simon Clematide, Marc Kupietz, Harald Lüngen, Caroline Iliadi
Document Type:	Conference Proceeding
Language:	English
Year of first Publication:	2019
Date of Publication (online):	2019/07/04
Publicationstate:	Veröffentlichungsversion
Reviewstate:	Peer-Review
Tag:	corpus linguistics; corpus processing; web corpora
GND Keyword:	Korpus <Linguistik>
First Page:	29
Last Page:	32
DDC classes:	400 Sprache / 400 Sprache, Linguistik
Open Access?:	ja
Leibniz-Classification:	Sprache, Linguistik
Linguistics-Classification:	Korpuslinguistik
Conferences, Workshops:	Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019
Licence (German):	Creative Commons - CC BY - Namensnennung 4.0 International

Open Access