Volltext-Downloads (blau) und Frontdoor-Views (grau)

Improving Sentence Boundary Detection for Spoken Language Transcripts

  • This paper presents experiments on sentence boundary detection in transcripts of spoken dialogues. Segmenting spoken language into sentence-like units is a challenging task, due to disfluencies, ungrammatical or fragmented structures and the lack of punctuation. In addition, one of the main bottlenecks for many NLP applications for spoken language is the small size of the training data, as the transcription and annotation of spoken language is by far more time-consuming and labour-intensive than processing written language. We therefore investigate the benefits of data expansion and transfer learning and test different ML architectures for this task. Our results show that data expansion is not straightforward and even data from the same domain does not always improve results. They also highlight the importance of modelling, i.e. of finding the best architecture and data representation for the task at hand. For the detection of boundaries in spoken language transcripts, we achieve a substantial improvement when framing the boundary detection problem as a sentence pair classification task, as compared to a sequence tagging approach.

Export metadata

Additional Services

Share in Twitter Search Google Scholar


Author:Ines RehbeinGND, Josef RuppenhoferGND, Thomas SchmidtORCiDGND
Parent Title (English):Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC), May 11-16, 2020, Palais du Pharo, Marseille, France
Publisher:European Language Resources Association
Place of publication:Paris
Editor:Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Document Type:Conference Proceeding
Year of first Publication:2020
Date of Publication (online):2020/05/19
Tag:corpus creation; sentence boundary detection; spoken language transcripts
GND Keyword:Automatische Spracherkennung; Gesprochene Sprache; Korpus <Linguistik>; Maschinelles Lernen; Satzende
First Page:7102
Last Page:7111
DDC classes:400 Sprache / 400 Sprache, Linguistik
Open Access?:ja
Leibniz-Classification:Sprache, Linguistik
Licence (English):License LogoCreative Commons - Attribution-NonCommercial 4.0 International