Conversion into the archival format I5
- The IDS repository aims at long-term archival of linguistic resources and tools in the field of German studies. This chapter introduces I5, the archival format for textual data in the repository. I5 is characterised as a TEI customisation, featuring a tripartite corpus macrostructure and some renamings and restructurings of TEI elements to fit the corpus holdings compiled since 1964 at the IDS. After a brief introduction to I5, the contribution shows how the conversion to I5 is achieved for various input formats by using the examples of features of concrete corpora in the archive. The two cases covered are instances of the KED corpus of simple German which came as a CWB-based XML called VRT, and Wikipedia corpora which came in Wiki markup, both of which were converted to I5 using different strategies. We further show how I5 data are ingested into the archive and what happens if someone wishes to extract such data from the archive for their own subsequent use. Finally we mention some alternative archival formats used in other archives such as TEI proper or DTA-Bf.
| Author: | Harald LüngenORCiDGND, Ines Pisetta |
|---|---|
| URN: | urn:nbn:de:bsz:mh39-135908 |
| DOI: | https://doi.org/10.1515/9783112208212-010 |
| ISBN: | 978-3-11-220821-2 |
| ISSN: | 2751-1286 |
| Parent Title (English): | Harmonizing language data. Standards for linguistic resources |
| Series (Serial Number): | Digital Linguistics (4) |
| Publisher: | de Gruyter |
| Place of publication: | Berlin/Boston |
| Editor: | Piotr BańskiORCiDGND, Ulrich HeidORCiDGND, Laura HerzbergORCiDGND |
| Document Type: | Part of a Book |
| Language: | English |
| Year of first Publication: | 2025 |
| Date of Publication (online): | 2025/12/09 |
| Publishing Institution: | Leibniz-Institut für Deutsche Sprache (IDS) |
| Publicationstate: | Veröffentlichungsversion |
| Reviewstate: | (Verlags)-Lektorat |
| Tag: | Component Metadata Infrastructure (CMDI); I5; InvenioRDM; Submission Information Package (SIP); TEI; XML; linguistic resources; repository; textual data; verticalised text format (VRT) |
| GND Keyword: | Computerunterstützte Kommunikation; Daten; Germanistik; Korpus <Linguistik>; Langzeitarchivierung; Metadaten; Repository <Informatik>; Text Encoding Initiative; XML |
| First Page: | 229 |
| Last Page: | 249 |
| DDC classes: | 400 Sprache / 400 Sprache, Linguistik |
| Open Access?: | ja |
| Linguistics-Classification: | Computerlinguistik |
| Linguistics-Classification: | Korpuslinguistik |
| Program areas: | Digitale Sprachwissenschaft |
| Licence (English): | Creative Commons - Attribution 4.0 International |


