Volltext-Downloads (blau) und Frontdoor-Views (grau)

News from the International Comparable Corpus. First launch of ICC written

  • The International Comparable Corpus (ICC) (Kirk/Čermáková 2017; Čermáková et al. 2021) is an open initiative which aims to improve the empirical basis for contrastive linguistics by compiling comparable corpora for many languages and making them as freely available as possible as well as providing tools with which they can easily be queried and analysed. In this contribution we present the first release of written language parts of the ICC which includes corpora for Chinese, Czech, English, German, Irish (partly), and Norwegian. Each of the released corpora contains 400k words distributed over 14 different text categories according to the ICC specifications. Our poster covers the design basics of the ICC, its TEI encoding, a demonstration of using the ICC via different query tools, and an outlook on future plans. Similar to the European Reference Corpus EuReCo (Kupietz et al. 2020), ICC follows the approach of reusing existing linguistic resources wherever possible in order to cover as many languages as possible with realistic effort in as short a time as possible. In contrast to EuReCo, however, comparable corpus pairs are not defined dynamically in the usage phase, but the compositions of the corpora are fixed in the ICC design. The approaches are thus complementary in this respect. The design principles and composition of the ICC are based on those of the International Corpus of English (ICE) (Greenbaum (ed.) 1996), with the deviation that the ICC includes the additional text category blog post and excludes spoken legal texts (see Čermáková et al. 2021 for details). ICC’s fixed-design approach has the advantage that all single-language corpora in the ICC have the same composition with respect to the selected text types and that this guarantees that the selected broad spectrum of potential influencing variables for linguistic variation is always represented. The disadvantage, however, is that this can only be achieved for quite small corpora and that the generalisability of comparative findings based on the ICC corpora will often need to be checked on larger monolingual corpora or translation corpora (Čermáková/Ebeling/Oksefjell Ebeling forthcoming). Arguing that such issues with comparability and representativeness are inevitable, in one way or the other, and need to be dealt with, our poster will discuss and exemplify the text selections in more detail.

Download full text files

Export metadata

Additional Services

Search Google Scholar


Author:Marc KupietzORCiD, Adrien BarbaresiORCiDGND, Anna ČermákováORCiDGND, Małgorzata Czachor, Nils DiewaldORCiDGND, Jarle Ebeling, Rafał L. Górski, Eliza MargarethaGND, John KirkORCiD, Michal KřenORCiD, Harald LüngenGND, Signe Oksefjell Ebeling, Mícheál Ó Meachair, Ines Pisetta, Elaine Uí DhonnchadhaORCiD, Friedemann VogelORCiDGND, Rebecca WilmORCiD, Jiajin Xu, Rameela Yaddehige
Parent Title (English):10th International Contrastive Linguistics Conference (ICLC-10), 18-21 July, 2023, Mannheim, Germany
Publisher:IDS-Verlag; Leibniz-Institut für Deutsche Sprache (IDS)
Place of publication:Mannheim
Editor:Beata Trawinski, Marc Kupietz, Kristel Proost, Jörg Zinken
Document Type:Part of a Book
Year of first Publication:2023
Date of Publication (online):2023/10/19
Tag:comparable corpora; contrastive linguistics; corpus linguistics; international comparable corpus; linguistic research software
First Page:45
Last Page:48
DDC classes:400 Sprache / 400 Sprache, Linguistik
Open Access?:ja
Leibniz-Classification:Sprache, Linguistik
Linguistics-Classification:Kontrastive Linguistik
Program areas:Digitale Sprachwissenschaft
Licence (German):License LogoCreative Commons - Namensnennung-Weitergabe unter gleichen Bedingungen 3.0 Deutschland