Refine
Document Type
- Part of a Book (2)
- Article (1)
Language
- English (3)
Has Fulltext
- yes (3)
Is part of the Bibliography
- no (3)
Keywords
- Slowenisch (3) (remove)
Publicationstate
Reviewstate
- Peer-Review (2)
Publisher
- IDS-Verlag (2)
- Routledge (1)
This paper describes a method for extracting collocation data from text corpora based on a formal definition of syntactic structures, which takes into account not only the POS-tagging level of annotation but also syntactic parsing (syntactic treebank model) and introduces the possibility of controlling the canonical form of extracted collocations based on statistical data on forms with different properties in the corpus. Specifically, we describe the results of extraction from the syntactically tagged Gigafida 2.1 corpus. Using the new method, 4,002,918 collocation candidates in 81 syntactic structures were extracted. We evaluate the extracted data sample in more detail, mainly in relation to properties that affect the extraction of canonical forms: definiteness in adjectival collocations, grammatical number in noun collocations, comparison in adjectival and adverbial collocations, and letter case (uppercase and lowercase) in canonical forms. The conclusion highlights the potential of the methodology used for the grammatical description of collocation and phrasal syntax and the possibilities for improving the model in the process of compilation of a digital dictionary database for Slovene.
In this paper we present Trendi, a monitor corpus of written Slovene, which has been compiled recently as part of the SLED (Monitor corpus and related resources) project. The methodology and the contents of the corpus are presented, as well as the findings of the survey that aimed to identify the needs of potential users related to topical language use. The Trendi corpus currently contains news articles and other web content from 110 different sources, with the texts being collected and linguistically annotated on a daily basis. The corpus complements Gigafida 2.0, a 1.13-billion-word reference corpus of standard written Slovene. Also discussed are the ways in which the corpus will be integrated into various lexicographic projects, helping not only in the identification of neologisms but also in monitoring changes in already identified language phenomena.
This article aims to show that it is only by comparing different multilingual communities that a typologically relevant description of such communities is made possible. An example (Brix, 1982) is presented to demonstrate that a usable description of such regions is implicitly based on comparison and what factors are important in this respect. It is shown that only a model of variables which can comprise the analogous traits of the situations as well as the differences between them enables an adequate comparison to be made. The examples of the Croats in the Burgenland and of the Slovenes in Carinthia show what the consequences for the description of the situations are and what difficulties arise with such a description. The domains of the semi-official use of language are examined to find out what parallels and differences in the factors which are used for the description of multilingual communities (e.g. economic situation, legislation, historical development, sociolinguistic situation) can be related the peculiarities of language use in the two situations. In this way typological similarities and certain idiosyncratic characteristics of the two regions can be understood.