Treebanking User-Generated Content: A Proposal for a Unified Representation in Universal Dependencies
- The paper presents a discussion on the main linguistic phenomena of user-generated texts found in web and social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework. Given on the one hand the increasing number of treebanks featuring user-generated content, and its somewhat inconsistent treatment in these resources on the other, the aim of this paper is twofold: (1) to provide a short, though comprehensive, overview of such treebanks - based on available literature - along with their main features and a comparative analysis of their annotation criteria, and (2) to propose a set of tentative UD-based annotation guidelines, to promote consistent treatment of the particular phenomena found in these types of texts. The main goal of this paper is to provide a common framework for those teams interested in developing similar resources in UD, thus enabling cross-linguistic consistency, which is a principle that has always been in the spirit of UD.
| Author: | Manuela Sanguinetti, Cristina Bosco, Lauren Cassidy, Özlem Çetinoğlu, Alessandra Teresa Cignarella, Teresa Lynn, Ines RehbeinGND, Josef RuppenhoferGND, Djamé Seddah, Amir ZeldesGND |
|---|---|
| URN: | urn:nbn:de:bsz:mh39-98686 |
| URL: | http://www.lrec-conf.org/proceedings/lrec2020/index.html#5240 |
| ISBN: | 979-10-95546-34-4 |
| Parent Title (English): | Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC), May 11-16, 2020, Palais du Pharo, Marseille, France |
| Publisher: | European Language Resources Association |
| Place of publication: | Paris |
| Editor: | Nicoletta CalzolariGND, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher CieriORCiDGND, Thierry Declerck, Sara Goggi, Hitoshi IsaharaGND, Bente MaegaardGND, Joseph MarianiGND, Hélène Mazo, Asuncion MorenoGND, Jan Odijk, Stelios Piperidis |
| Document Type: | Conference Proceeding |
| Language: | English |
| Year of first Publication: | 2020 |
| Date of Publication (online): | 2020/06/01 |
| Publicationstate: | Zweitveröffentlichung |
| Reviewstate: | Peer-Review |
| Tag: | UGC; Universal Dependencies; Web; annotation guidelines; treebanks |
| GND Keyword: | Annotation; Natürliche Sprache; Social Media; Strukturbaum; User Generated Content |
| First Page: | 5240 |
| Last Page: | 5250 |
| DDC classes: | 400 Sprache / 400 Sprache, Linguistik |
| Open Access?: | ja |
| Leibniz-Classification: | Sprache, Linguistik |
| Linguistics-Classification: | Computerlinguistik |
| Linguistics-Classification: | Korpuslinguistik |
| Program areas: | P2: Mündliche Korpora |
| Licence (English): | Creative Commons - Attribution-NonCommercial 4.0 International |


