Building a historical corpus for Classical Portuguese: some technological aspects
- This paper describes the restructuring process of a large corpus of historical documents and the system architecture that is used for accessing it. The initial challenge of this process was to get the most out of existing material, normalizing the legacy markup and harvesting the inherent information using widely available standards. This resulted in a conceptual and technical restructuring of the formerly existing corpus. The development of the standardized markup and techniques allowed the inclusion of important new materials, such as original 16th and 17th century prints and manuscripts; and enlarged the potential user groups. On the technological side, we were grounded on the premise that open standards are the best way of making sure that the resources will be accessible even after years in an archive. This is a welcomed result in view of the additional consequence of the remodeled corpus concept: it serves as a repository for important historical documents, some of which had been preserved for 500 years in paper format. This very rich material can from now on be handled freely for linguistic research goals.
Author: | Maria Clara Paixão de SousaORCiD, Thorsten TrippelORCiDGND |
---|---|
URN: | urn:nbn:de:bsz:mh39-126407 |
URL: | http://www.lrec-conf.org/proceedings/lrec2006/pdf/378_pdf.pdf |
URL: | https://aclanthology.org/L06-1222/ |
Parent Title (English): | Proceedings of the fifth international conference on language resources and evaluation (LREC’06). 22 May - 28 May 2006, Genoa, Italy |
Publisher: | European Language Resources Association (ELRA) |
Place of publication: | Paris |
Editor: | Nicoletta Calzolari, Khalid Choukri, Aldo Gangemi, Bente Maegaard, Joseph Mariani, Jan Odijk, Daniel Tapias |
Document Type: | Conference Proceeding |
Language: | English |
Year of first Publication: | 2006 |
Date of Publication (online): | 2024/04/18 |
Publishing Institution: | Leibniz-Institut für Deutsche Sprache (IDS) |
Publicationstate: | Veröffentlichungsversion |
Reviewstate: | Peer-Review |
Tag: | Portuguese; historical corpus; historical documents; standard based archive; standardized markup; web based language resource access |
GND Keyword: | Annotation; Archivierung; Computerlinguistik; Korpus <Linguistik>; Metadaten; Portugiesisch; Sprachdaten |
First Page: | 1831 |
Last Page: | 1836 |
DDC classes: | 400 Sprache / 400 Sprache, Linguistik |
Open Access?: | ja |
Linguistics-Classification: | Computerlinguistik |
Linguistics-Classification: | Korpuslinguistik |
Licence (English): | Creative Commons - Attribution-NonCommercial-ShareAlike 3.0 Unported |