Standardising language data through the conversion pipeline TEIWorLD
- The conversion of data into a standard format is a crucial step in many research workflows. Standardisation enables data exchange, reuse, and analysis, which are essential for advancing knowledge in various fields. In this publication, we describe the conversion pipeline TEIWorLD (TEI Workflow for Language Data) that transforms written and spoken language data into standardised formats, specifically I5/TEI P5 XML for written data and ISO/TEI Transcriptions of Spoken Language for spoken data. The pipeline leverages existing tools to convert specific formats into these standards, with an additional transformation step for written data into the archival I5 (short for IDS TEI P5) format used at the Leibniz Institute for the German Language (IDS). We also present two use cases that demonstrate the practical application of standardisation with our conversion pipeline TEIWorLD in language data management on a corpus consisting of more than one format, enabling researchers to efficiently analyse and share their data.
| Author: | Jennifer EckerORCiDGND |
|---|---|
| URN: | urn:nbn:de:bsz:mh39-136778 |
| DOI: | https://doi.org/10.21248/idsopen.15.2026.54 |
| ISBN: | 978-3-948831-78-3 |
| ISSN: | 2749-9855 |
| Series (Serial Number): | IDSopen: Online-only Publikationen des Leibniz-Instituts für Deutsche Sprache (15) |
| Publisher: | IDS-Verlag |
| Place of publication: | Mannheim |
| Editor: | Norman FiedlerGND, Katrin Hein-AntonioliGND, Siegwalt LindenfelserORCiDGND, Beata TrawińskiORCiDGND |
| Document Type: | Book |
| Language: | German |
| Year of first Publication: | 2026 |
| Date of Publication (online): | 2026/03/03 |
| Publishing Institution: | Leibniz-Institut für Deutsche Sprache (IDS) |
| Publicationstate: | Veröffentlichungsversion |
| Reviewstate: | (Verlags)-Lektorat |
| Tag: | Datenumwandlung; IDS TEI P5; Schlagwortkonversion; Schlagwortumwandlung Data conversion; Keywords conversion; TEIWorLD |
| GND Keyword: | Datenkonvertierung; Gesprochene Sprache; Leibniz-Institut für Deutsche Sprache (IDS); Pipeline-Verarbeitung; Schriftsprache |
| Page Number: | 17 |
| DDC classes: | 400 Sprache / 400 Sprache, Linguistik |
| Open Access?: | ja |
| Program areas: | Digitale Sprachwissenschaft |
| Licence (German): | Creative Commons - Namensnennung-Weitergabe unter gleichen Bedingungen 3.0 Deutschland |


