TY - CHAP U1 - Konferenzveröffentlichung A1 - Rehbein, Ines A1 - Ruppenhofer, Josef A1 - Schmidt, Thomas ED - Calzolari, Nicoletta ED - Béchet, Frédéric ED - Blache, Philippe ED - Choukri, Khalid ED - Cieri, Christopher ED - Declerck, Thierry ED - Goggi, Sara ED - Isahara, Hitoshi ED - Maegaard, Bente ED - Mariani, Joseph ED - Mazo, Hélène ED - Moreno, Asuncion ED - Odijk, Jan ED - Piperidis, Stelios T1 - Improving Sentence Boundary Detection for Spoken Language Transcripts T2 - Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC), May 11-16, 2020, Palais du Pharo, Marseille, France N2 - This paper presents experiments on sentence boundary detection in transcripts of spoken dialogues. Segmenting spoken language into sentence-like units is a challenging task, due to disfluencies, ungrammatical or fragmented structures and the lack of punctuation. In addition, one of the main bottlenecks for many NLP applications for spoken language is the small size of the training data, as the transcription and annotation of spoken language is by far more time-consuming and labour-intensive than processing written language. We therefore investigate the benefits of data expansion and transfer learning and test different ML architectures for this task. Our results show that data expansion is not straightforward and even data from the same domain does not always improve results. They also highlight the importance of modelling, i.e. of finding the best architecture and data representation for the task at hand. For the detection of boundaries in spoken language transcripts, we achieve a substantial improvement when framing the boundary detection problem as a sentence pair classification task, as compared to a sequence tagging approach. KW - Automatische Spracherkennung KW - Gesprochene Sprache KW - Korpus KW - Satzende KW - spoken language transcripts KW - sentence boundary detection KW - corpus creation KW - Maschinelles Lernen Y1 - 2020 U6 - https://nbn-resolving.org/urn:nbn:de:bsz:mh39-98382 UN - https://nbn-resolving.org/urn:nbn:de:bsz:mh39-98382 UR - http://www.lrec-conf.org/proceedings/lrec2020/index.html#7102 SN - 979-10-95546-34-4 SB - 979-10-95546-34-4 SP - 7102 EP - 7111 PB - European Language Resources Association CY - Paris ER -