Refine
Document Type
- Conference Proceeding (2) (remove)
Language
- English (2)
Has Fulltext
- yes (2)
Keywords
- Computerlinguistik (1)
- Dateiformat (1)
- Datenmanagement (1)
- Korpus <Linguistik> (1)
- Langzeitarchivierung (1)
- Metadaten (1)
- Nutzungsrecht (1)
- Repository <Informatik> (1)
- XML (1)
- format migration (1)
Publicationstate
- Veröffentlichungsversion (2) (remove)
Reviewstate
- Peer-Review (2) (remove)
Publisher
This paper addresses long-term archival for large corpora. Three aspects specific to language resources are focused, namely (1) the removal of resources for legal reasons, (2) versioning of (unchanged) objects in constantly growing resources, especially where objects can be part of multiple releases but also part of different collections, and (3) the conversion of data to new formats for digital preservation. It is motivated why language resources may have to be changed, and why formats may need to be converted. As a solution, the use of an intermediate proxy object called a signpost is suggested. The approach will be exemplified with respect to the corpora of the Leibniz Institute for the German Language in Mannheim, namely the German Reference Corpus (DeReKo) and the Archive for Spoken German (AGD).
XML has been designed for creating structured documents, but the information that is encoded in these structures are, by definition, out of scope for XML. Additional sources, normally not easily interpretable by computers, such as documentation are needed to determine the intention of specific tags in a tag-set. The Component Metadata Infrastructure (CMDI) takes a rather pragmatic approach to foster interoperability between XML instances in the domain of metadata descriptions for language resources. This paper gives an overview of this approach.