Refine
Document Type
- Part of a Book (3)
- Conference Proceeding (2)
- Article (1)
Language
- English (6) (remove)
Keywords
Publicationstate
Reviewstate
- (Verlags)-Lektorat (1)
- Peer-review (1)
Publisher
In this paper we present an approach to faceted search in large language resource repositories. This kind of search which enables users to browse through the repository by choosing their personal sequence of facets heavily relies on the availability of descriptive metadata for the objects in the repository. This approach therefore informs the collection of a minimal set of metatdata for language resources. The work described in this paper has been funded by the EC within the ESFRI infrastructure project CLARIN.
The authors describe two data sets submitted to the database of MWE evaluation resources: (1) cranberry expressions in English and (2) cranberry expressions in German. The first package contains a collection of 444 cranberry words in German (CWde.txt) and a collection of the corresponding cranberry expressions (CCde.txt). The second package consists of a collection of 77 cranberry words in English (CWen.txt) and a collection of the corresponding cranberry expressions (CCen.txt). The data included in these packages was extracted from the Collection of Distributionally Idiosyncratic Items (CoDII), an electronic linguistic resource of lexical items with idiosyncratic occurrence patterns. Each package contains a readme file, and can be downloaded from multiword.wiki.sourceforge.net/Resources.
The paper reports on the results of a scientific colloquium dedicated to the creation of standards and best practices which are needed to facilitate the integration of language resources for CMC stemming from different origins and the linguistic analysis of CMC phenomena in different languages and genres. The key issue to be solved is that of interoperability – with respect to the structural representation of CMC genres, linguistic annotations metadata, and anonymization/pseudonymization schemas. The objective of the paper is to convince more projects to partake in a discussion about standards for CMC corpora and for the creation of a CMC corpus infrastructure across languages and genres. In view of the broad range of corpus projects which are currently underway all over Europe, there is a great window of opportunity for the creation of standards in a bottom-up approach.
The paper presents an XML schema for the representation of genres of computer-mediated communication (CMC) that is compliant with the encoding framework defined by the TEI. It was designed for the annotation of CMC documents in the project Deutsches Referenzkorpus zur internetbasierten Kommunikation (DeRiK), which aims at building a corpus on language use in the most popular CMC genres on the German-speaking Internet. The focus of the schema is on those CMC genres which are written and dialogic―such as forums, bulletin boards, chats, instant messaging, wiki and weblog discussions, microblogging on Twitter, and conversation on “social network” sites.
The schema provides a representation format for the main structural features of CMC discourse as well as elements for the annotation of those units regarded as “typical” for language use on the Internet. The schema introduces an element <posting>, which describes stretches of text that are sent to the server by a user at a certain point in time. Postings are the main constituting elements of threads and logfiles, which, in our schema, are the two main types of CMC macrostructures. For the microlevel of CMC documents (that is, the structure of the <posting> content), the schema introduces elements for selected features of Internet jargon such as emoticons, interaction words and addressing terms. It allows for easy anonymization of CMC data for purposes in which the annotated data are made publicly available and includes metadata which are necessary for referencing random excerpts from the data as references in dictionary entries or as results of corpus queries.
Documentation of the schema as well as encoding examples can be retrieved from the web at http://www.empirikom.net/bin/view/Themen/CmcTEI. The schema is meant to be a core model for representing CMC that can be modified and extended by others according to their own specific perspectives on CMC data. It could be a first step towards an integration of features for the representation of CMC genres into a future new version of the TEI Guidelines.