Volltext-Downloads (blau) und Frontdoor-Views (grau)

Conversion into the archival format I5

  • The IDS repository aims at long-term archival of linguistic resources and tools in the field of German studies. This chapter introduces I5, the archival format for textual data in the repository. I5 is characterised as a TEI customisation, featuring a tripartite corpus macrostructure and some renamings and restructurings of TEI elements to fit the corpus holdings compiled since 1964 at the IDS. After a brief introduction to I5, the contribution shows how the conversion to I5 is achieved for various input formats by using the examples of features of concrete corpora in the archive. The two cases covered are instances of the KED corpus of simple German which came as a CWB-based XML called VRT, and Wikipedia corpora which came in Wiki markup, both of which were converted to I5 using different strategies. We further show how I5 data are ingested into the archive and what happens if someone wishes to extract such data from the archive for their own subsequent use. Finally we mention some alternative archival formats used in other archives such as TEI proper or DTA-Bf.

Export metadata

Additional Services

Search Google Scholar

Statistics

frontdoor_oas
Metadaten
Author:Harald LüngenORCiDGND, Ines Pisetta
URN:urn:nbn:de:bsz:mh39-135908
DOI:https://doi.org/10.1515/9783112208212-010
ISBN:978-3-11-220821-2
ISSN:2751-1286
Parent Title (English):Harmonizing language data. Standards for linguistic resources
Series (Serial Number):Digital Linguistics (4)
Publisher:de Gruyter
Place of publication:Berlin/Boston
Editor:Piotr BańskiORCiDGND, Ulrich HeidORCiDGND, Laura HerzbergORCiDGND
Document Type:Part of a Book
Language:English
Year of first Publication:2025
Date of Publication (online):2025/12/09
Publishing Institution:Leibniz-Institut für Deutsche Sprache (IDS)
Publicationstate:Veröffentlichungsversion
Reviewstate:(Verlags)-Lektorat
Tag:Component Metadata Infrastructure (CMDI); I5; InvenioRDM; Submission Information Package (SIP); TEI; XML; linguistic resources; repository; textual data; verticalised text format (VRT)
GND Keyword:Computerunterstützte Kommunikation; Daten; Germanistik; Korpus <Linguistik>; Langzeitarchivierung; Metadaten; Repository <Informatik>; Text Encoding Initiative; XML
First Page:229
Last Page:249
DDC classes:400 Sprache / 400 Sprache, Linguistik
Open Access?:ja
Linguistics-Classification:Computerlinguistik
Linguistics-Classification:Korpuslinguistik
Program areas:Digitale Sprachwissenschaft
Licence (English):License LogoCreative Commons - Attribution 4.0 International