Refine
Year of publication
- 2020 (2) (remove)
Document Type
- Part of a Book (1)
- Conference Proceeding (1)
Language
- English (2) (remove)
Has Fulltext
- yes (2)
Is part of the Bibliography
- yes (2)
Keywords
- Annotation (1)
- Deutsch (1)
- Einbettung <Linguistik> (1)
- Korpus <Linguistik> (1)
- Maschinelles Lernen (1)
- Methodik (1)
- Redeerwähnung (1)
- Testdaten (1)
- Textanalyse (1)
- annotation (1)
Publicationstate
Reviewstate
- Peer-Review (2)
Publisher
We present recognizers for four very different types of speech, thought and writing representation (STWR) for German texts. The implementation is based on deep learning with two different customized contextual embeddings, namely FLAIR embeddings and BERT embeddings. This paper gives an evaluation of our recognizers with a particular focus on the differences in performance we observed between those two embeddings. FLAIR performed best for direct STWR (F1=0.85), BERT for indirect (F1=0.76) and free indirect (F1=0.59) STWR. For reported STWR, the comparison was inconclusive, but BERT gave the best average results and best individual model (F1=0.60). Our best recognizers, our customized language embeddings and most of our test and training data are freely available and can be found via www.redewiedergabe.de or at github.com/redewiedergabe.
Corpus REDEWIEDERGABE
(2020)
This article presents the corpus REDEWIEDERGABE, a German-language historical corpus with detailed annotations for speech, thought and writing representation (ST&WR). With approximately 490,000 tokens, it is the largest resource of its kind. It can be used to answer literary and linguistic research questions and serve as training material for machine learning. This paper describes the composition of the corpus and the annotation structure, discusses some methodological decisions and gives basic statistics about the forms of ST&WR found in this corpus.