Refine
Year of publication
- 2016 (347) (remove)
Document Type
- Part of a Book (136)
- Article (104)
- Conference Proceeding (51)
- Book (33)
- Part of Periodical (12)
- Working Paper (5)
- Doctoral Thesis (3)
- Other (2)
- Preprint (1)
Keywords
- Deutsch (113)
- Korpus <Linguistik> (47)
- Gesprochene Sprache (31)
- Konversationsanalyse (24)
- Wörterbuch (22)
- Interaktion (20)
- Computerunterstützte Lexikographie (19)
- Linguistik (17)
- Diskursanalyse (16)
- Kommunikation (15)
Publicationstate
- Veröffentlichungsversion (169)
- Zweitveröffentlichung (35)
- Postprint (17)
- Erstveröffentlichung (1)
Reviewstate
Publisher
- Institut für Deutsche Sprache (45)
- de Gruyter (34)
- De Gruyter (23)
- Winter (19)
- European Language Resources Association (ELRA) (13)
- Narr Francke Attempto (12)
- Retorika (8)
- Peter Lang (7)
- Linssen Druckcenter (6)
- Association for Computational Linguistics (5)
"Kaum [...] da, wird' ich gedisst!" Funktionale Aspekte des Banter-Prinzips auf dem Online-Prüfstand
(2016)
The article is to be considered as an attempt to enrich the theoretical approach of the Banter-Principle (Leech 1983) with an online point of view. Examples from Teamspeak- conversations and comments on the social network site Facebook reveal different user practices regarding the identifiability of the Banter-Principle: Nonverbal elements or emoticons in order to make sure that Banter is understood correctly in written language on the one hand; coping with assigned roles depending on dynamic group internal hierarchies in oral communication on the other hand. Nevertheless one question remains. Why should one disguise a cordial message rudely? My analysis shows two functions of Online Banter. Firstly, maximize the entertainment value of a conversation and secondly, establish an accepted online-identity.
'Faction' im Fernsehen - Produktionsbeobachtung des Scripted Reality-Formats mieten, kaufen, wohnen
(2016)
The present investigation targets the phenomenon commonly called control. Many languages including German and Polish employ non-finite clauses (besides finite clauses) as propositional complements. The subject of these complement clauses is left unexpressed and must generally be interpreted co-referentially with the subject or object of the matrix clause (subject or object control). However. there are also infinitive-selecting verbs that do not allow for a co- referential interpretation of the embedded subject - semantically, the embedded infinitives of these anti-control verbs are thus less dependent on or less unifiable with the matrix proposition. In Polish anti-control constructions, non-finite complements are overtly marked with the complementizer zeby, suggesting that they are structurally more complex (namely. containing a C-projection) than the non-finite complements in control constructions lacking zeby (modulo special contexts. viz. 'control switch'). In a comparative perspective, the paper brings corpuslinguistic and experimental evidence to bear on the question whether surface appearances notwithstanding, the infinitival complements of anti-control verbs in German should similarly be analyzed as truly sentential, i.e., C-headed structures.
The paper reports the results of the curation project ChatCorpus2CLARIN. The goal of the project was to develop a workflow and resources for the integration of an existing chat corpus into the CLARIN-D research infrastructure for language resources and tools in the Humanities and the Social Sciences (http://clarin-d.de). The paper presents an overview of the resources and practices developed in the project, describes the added value of the resource after its integration and discusses, as an outlook, to what extent these practices can be considered best practices which may be useful for the annotation and representation of other CMC and social media corpora.
A comparison between morphological complexity measures: typological data vs. language corpora
(2016)
Language complexity is an intriguing phenomenon argued to play an important role in both language learning and processing. The need to compare languages with regard to their complexity resulted in a multitude of approaches and methods, ranging from accounts targeting specific structural features to global quantification of variation more generally. In this paper, we investigate the degree to which morphological complexity measures are mutually correlated in a sample of more than 500 languages of 101 language families. We use human expert judgements from the World Atlas of Language Structures (WALS), and compare them to four quantitative measures automatically calculated from language corpora. These consist of three previously defined corpus-derived measures, which are all monolingual, and one new measure based on automatic word-alignment across pairs of languages. We find strong correlations between all the measures, illustrating that both expert judgements and automated approaches converge to similar complexity ratings, and can be used interchangeably.
There have been several attempts to annotate communicative functions to utterances of verbal feedback in English previously. Here, we suggest an annotation scheme for verbal and non-verbal feedback utterances in French including the categories base, attitude, previous and visual. The data comprises conversations, maptasks and negotiations from which we extracted ca. 13,000 candidate feedback utterances and gestures. 12 students were recruited for the annotation campaign of ca. 9,500 instances. Each instance was annotated by between 2 and 7 raters. The evaluation of the annotation agreement resulted in an average best-pair kappa of 0.6. While the base category with the values acknowledgement, evaluation, answer, elicit and other achieves good agreement, this is not the case for the other main categories. The data sets, which also include automatic extractions of lexical, positional and acoustic features, are freely available and will further be used for machine learning classification experiments to analyse the form-function relationship of feedback.
The present paper reports the first results of the compilation and annotation of a blog corpus for German. The main aim of the project is the representation of the blog discourse structure and relations between its elements (blog posts, comments) and participants (bloggers, commentators). The data included in the corpus were manually collected from the scientific blog portal SciLogs. The feature catalogue for the corpus annotation includes three types of information which is directly or indirectly provided in the blog or can be construed by means of statistical analysis or computational tools. At this point, only directly available information (e.g. title of the blog post, name of the blogger etc.) has been annotated. We believe, our blog corpus can be of interest for the general study of blog structure or related research questions as well as for the development of NLP methods and techniques (e.g. for authorship detection).
The paper deals with the use of ICH WEIß NICHT (‘I don’t know’) in German talk-in-interaction. Pursuing an Interactional Linguistics approach, we identify different interactional uses of ICH WEIß NICHT and discuss their relationship to variation in argument structure (SV (O), (O)VS, V-only). After ICH WEIß NICHT with full complementation, speakers emphasize their lack of knowledge or display reluctance to answer. In contrast, after variants without an object complement, in contrast, speakers display uncertainty about the truth of the following proposition or about its sufficiency as an answer. Thus, while uses with both subject and object tend to close a sequence or display lack of knowledge, responses without an object, in contrast, function as a prepositioned epistemic hedge or a pragmatic marker framing the following TCU. When ICH WEIß NICHT is used in response to a statement, it indexes disagreement (independently from all complementation patterns).
Our paper deals with the use of ICH WEIß NICHT (‘I don’t know’) in German talk-in-interaction. Pursuing an Interactional Linguistics approach, we identify different interactional uses of ICH WEIß NICHT and discuss their relationship to variation in argument structure (SV (O), (O)VS, V-only). After ICH WEIß NICHT with full complementation, speakers emphasize their lack of knowledge or display reluctance to answer. In contrast, after variants without an object complement, in contrast, speakers display uncertainty about the truth of the following proposition or about its sufficiency as an answer. Thus, while uses with both subject and object tend to close a sequence or display lack of knowledge, responses without an object, in contrast, function as a prepositioned epistemic hedge or a pragmatic marker framing the following TCU. When ICH WEIß NICHT is used in response to a statement, it indexes disagreement (independently from all complementation patterns).
This study investigates high vowel laxing in the Louisiana French of the Lafourche Basin. Unlike Canadian French, in which the high vowels /i, y, u/ are traditionally described as undergoing laxing (to [I, Y, U]) in word-final syllables closed by any consonant other than a voiced fricative (see Poliquin 2006), Oukada (1977) states that in the Louisiana French of Lafourche Parish, any coda consonant will trigger high vowel laxing of /i/; he excludes both /y/ and /u/ from his discussion of high vowel laxing. The current study analyzes tokens of /i, y, u/ from pre-recorded interviews with three older male speakers from Terrebonne Parish. We measured the first and second formants and duration for high vowel tokens produced in four phonetic environments, crossing syllable type (open vs. closed) by consonant type (voiced fricative vs. any consonant other than a voiced fricative). Results of the acoustic analysis show optional laxing for /i/ and /y/ and corroborate the finding that high vowels undergo laxing in word-final closed syllables, regardless of consonant type. Data for /u/ show that the results vary widely by speaker, with the dominant pattern (shown by two out of three speakers) that of lowering and backing in the vowel space of closed syllable tokens. Duration data prove inconclusive, likely due to the effects of stress. The formant data published here constitute the first acoustic description of high vowels for any variety of Louisiana French and lay the groundwork for future study on these endangered varieties.
American English and German AI, AU observed in cognates such as Wein, wine, Haus, house are usually treated on a par, represented with the same initial vowel (cf. [ai], [au] for Am. Engl, and German [1]). Yet, acoustic measurements indicate differences as the relevant trajectories characteristically cross in Am. Engl, but not in German. These data may indicate consistency with the same initial target for these diphthongs in German, supporting the choice of the same Symbol /a/ in phonemic representation, as opposed to distinct targets (and distinct initial phonemes) in American English.
Aktuelle Änderungen des Rats für deutsche Rechtschreibung 2016 - Hintergründe und Begründungen
(2016)
The English language has taken advantage of the Digital Revolution to establish itself as the global language; however, only 28.6 %of Internet users speak English as their native language. Machine Trans-lation (MT) is a powerful technology that can bridge this gap. In devel-opment since the mid-20th century, MT has become available to every Internet user in the last decade, due to free online MT services. This paper aims to discuss the implications that these tools may have for the privacy of their users and how they are addressed by EU data protec-tion law. It examines the data-flows in respect of the initial processing (both from the perspective of the user and the MT service provider) and potential further processing that may be undertaken by the MT service provider.
A model of grammar needs to reconcile the undesirability inherent to allomorphy, the apparent extra burden on learning and memory, with its occurrence and possible stability. OT approaches this task by positing an anti-allomorphy constraint, henceforth referred to as "OO-correspondence", which requires leveling (i.e. sameness of sound structure) in related word forms (Benua 1997). The occurrence of allomorphy then indicates crucial domination of OO-correspondence by other constraints. To assess the adequacy of this proposal it is necessary to establish the level of abstractness at which OO-correspondence applies and to examine the consequences of this decision for ranking order. While proponents of OT tacitly assume the level in question to be rather concrete, the notion of allomorphy as originally envisioned in Structuralism was defined by distinctness at a more abstract level referred to as "phonemic" (Harris 1942; Nida 1944). The basic intuition here is that the defining property of subphonemic sound properties, their conditionedness by context, entails that whatever burden they put on learning and memory is of a fundamentally different nature than that entailed by phonemic distinctness. The evidence from German supports that intuition in that leveling can be shown to target phonemic sound structure to the exclusion of subphonemic properties. Allomorphy, defined by phonemic alterna-tion, tends to serve phonological optimization in closed class items (function words, affixes) while serving to express morphological distinctions in open class items. The key to demonstrating the correlations in question lies in the discernment of phonemic structure, which is therefore at the core of the article.
This article describes an English Zulu learners’ dictionary that is part of a larger set of information tools, namely an online Zulu course, an e-dictionary of possessives (which was implemented earlier) accompanied by training software offering translation tasks on several levels, and an ontology of morphemic items categorizing and describing all parts of speech of Zulu. The underlying lexicographic database contains the usual type of lexicographic data, such as translation equivalents and their respective morphosyntactic data, but its entries have been extended with data related to the lessons of the online course in order to enable the learner to link both tools autonomously. The ‘outer matter’ is integrated into the website in the form of several texts on additional web pages (how-to-use, typical outputs, grammar tables, information on morphosyntactic rules, etc.). The dictionary comprises a modular system, where each module fulfils one of the necessary functions.
Analepsen mit Topik-Drop sind hochfrequente sprachliche Strukturen in Interaktionen. In dieser Arbeit stehen neben der interaktionslinguistischen Untersuchung der Diskursfunktionen, Bedingungen und Restriktionen von Analepsen diskurssemantische Perspektiven und Fragestellungen im Mittelpunkt, insbesondere die detaillierte Beschreibung der semantischen Relationen zwischen Analepsen und ihrem Präkontext. Die Analepsenresolution muss dabei situiert erklärt werden, da das Verstehen von Analepsen von der kontextuellen Einbettung sowie von grammatischen, semantischen und pragmatischen Merkmalen der Äußerung abhängt.
Es wird gezeigt, dass kognitive Zuschreibungen hinsichtlich der Interaktionsbeteiligten auch mit interaktionslinguistischen Methoden möglich sind. Die Studie demonstriert außerdem, dass die Kombination von qualitativen und quantitativen Methoden erkenntnisträchtig ist, um spezifische Verwendungspräferenzen von analeptischen im Vergleich zu anaphorischen Äußerungen herauszuarbeiten.
This thesis consists of the following three papers that all have been published in international peer-reviewed journals:
Chapter 3: Koplenig, Alexander (2015c). The Impact of Lacking Metadata for the Measurement of Cultural and Linguistic Change Using the Google Ngram Data Sets—Reconstructing the Composition of the German Corpus in Times of WWII. Published in: Digital Scholarship in the Humanities. Oxford: Oxford University Press. [doi:10.1093/llc/fqv037]
Chapter 4: Koplenig, Alexander (2015b). Why the quantitative analysis of dia-chronic corpora that does not consider the temporal aspect of time-series can lead to wrong conclusions. Published in: Digital Scholarship in the Humanities. Oxford: Oxford University Press. [doi:10.1093/llc/fqv030]
Chapter 5: Koplenig, Alexander (2015a). Using the parameters of the Zipf–Mandelbrot law to measure diachronic lexical, syntactical and stylistic changes – a large-scale corpus analysis. Published in: Corpus Linguistics and Linguistic Theory. Berlin/Boston: de Gruyter. [doi:10.1515/cllt-2014-0049]
Chapter 1 introduces the topic by describing and discussing several basic concepts relevant to the statistical analysis of corpus linguistic data. Chapter 2 presents a method to analyze diachronic corpus data and a summary of the three publications. Chapters 3 to 5 each represent one of the three publications. All papers are printed in this thesis with the permission of the publishers.
Annotating Discourse Relations in Spoken Language: A Comparison of the PDTB and CCR Frameworks
(2016)
In discourse relation annotation, there is currently a variety of different frameworks being used, and most of them have been developed and employed mostly on written data. This raises a number of questions regarding interoperability of discourse relation annotation schemes, as well as regarding differences in discourse annotation for written vs. spoken domains. In this paper, we describe ouron annotating two spoken domains from the SPICE Ireland corpus (telephone conversations and broadcast interviews) according todifferent discourse annotation schemes, PDTB 3.0 and CCR. We show that annotations in the two schemes can largely be mappedone another, and discuss differences in operationalisations of discourse relation schemes which present a challenge to automatic mapping. We also observe systematic differences in the prevalence of implicit discourse relations in spoken data compared to written texts,find that there are also differences in the types of causal relations between the domains. Finally, we find that PDTB 3.0 addresses many shortcomings of PDTB 2.0 wrt. the annotation of spoken discourse, and suggest further extensions. The new corpus has roughly theof the CoNLL 2015 Shared Task test set, and we hence hope that it will be a valuable resource for the evaluation of automatic discourse relation labellers.
Im Verlauf der Geschehnisse in der arabischen Welt seit 2011 gewann der Begriff Arabischer Frühling an Bedeutung und avancierte zum Leitausdruck des Diskurses. Der Beitrag geht den Fragen nach, wie der Begriff Arabischer Frühling in der deutschsprachigen Öffentlichkeit sprachlich realisiert, mit welchen sprachlichen Mitteln er konstruiert und mit welchen Ereignissen – zuweilen auch Katastrophen – er identifiziert wurde bzw. wird. Dabei wird auf die symbolische Funktion des Frühlings sowohl aus historischer Perspektive der Vormärzzeit als auch aus heutiger Sicht eingegangen. Im Blickfeld der Untersuchung stehen darüber hinaus die Jahreszeitenbezeichnungen Winter, Herbst und Sommer und ihr symbolisches Verhältnis zu den arabischen Revolutionen.
In their analysis of methods that participants use to manage the realization of practical courses of action, Kendrick and Drew (2016/this issue) focus on cases of assistance, where the need to be addressed is Self’s, and Other lends a helping hand. In our commentary, we point to other forms of cooperative engagement that are ubiquitously recruited in interaction. Imperative requests characteristically expect compliance on the grounds of Other’s already established commitment to a wider and shared course of actions. Established commitments can also provide the engine behind recruitment sequences that proceed nonverbally. And forms of cooperative engagement that are well glossed as assistance can nevertheless be demonstrably oriented to established commitments. In sum, we find commitment to shared courses of action to be an important element in the design and progression of certain recruitment sequences, where the involvement of Other is best defined as contribution. The commentary highlights the importance of interdependent orientations in the organization of cooperation. Data are in German, Italian, and Polish.
Wiegand’s opus magnum „Wörterbuchforschung“ ends with a chapter on the state and the relevant taslcs for research into dictionary use in the middle of the 1990s. This article aims at reflecting the taste and the relevance of dictionary usage research 20 years later. I will argue that the fundamentally changed lexicographic landscape makes it necessary to shift the focus of research. In my view, the most important aim of research into dictionary use can no longer be limited to improving dictionaries. Research into dictionary use should also raise more awareness for user- orientation in general and should provide methodological reflection to enlighten the increasingly important usage statistics for online dictionaries. Another goal should be to look behind the scenes of collaborative dictionaries in order to provide background data to classify their relevance in relation to dictionaries elaborated by lexicographic experts. The crisis of lexicography makes it also necessary to broaden our view and concentrate on situations in which linguistic questions arise. In this context, we could examine in which of these situations the consultation of lexicographic data helps. In summary, the aim of research into dictionary use is to identify the fields where sound lexicographic work is really helpful for potential users.
In this paper, we describe preliminary results from an ongoing experiment wherein we classify two large unstructured text corpora—a web corpus and a newspaper corpus—by topic domain (or subject area). Our primary goal is to develop a method that allows for the reliable annotation of large crawled web corpora with meta data required by many corpus linguists. We are especially interested in designing an annotation scheme whose categories are both intuitively interpretable by linguists and firmly rooted in the distribution of lexical material in the documents. Since we use data from a web corpus and a more traditional corpus, we also contribute to the important field of corpus comparison and corpus evaluation. Technically, we use (unsupervised) topic modeling to automatically induce topic distributions over gold standard corpora that were manually annotated for 13 coarse-grained topic domains. In a second step, we apply supervised machine learning to learn the manually annotated topic domains using the previously induced topics as features. We achieve around 70% accuracy in 10-fold cross validations. An analysis of the errors clearly indicates, however, that a revised classification scheme and larger gold standard corpora will likely lead to a substantial increase in accuracy.
Smiling individuals are usually perceived more favorably than non-smiling ones—they are judged as happier, more attractive, competent, and friendly. These seemingly clear and obvious consequences of smiling are assumed to be culturally universal, however most of the psychological research is carried out in WEIRD societies (Western, Educated, Industrialized, Rich, and Democratic) and the influence of culture on social perception of nonverbal behavior is still understudied. Here we show that a smiling individual may be judged as less intelligent than the same non-smiling individual in cultures low on the GLOBE’s uncertainty avoidance dimension. Furthermore, we show that corruption at the societal level may undermine the prosocial perception of smiling—in societies with high corruption indicators, trust toward smiling individuals is reduced. This research fosters understanding of the cultural framework surrounding nonverbal communication processes and reveals that in some cultures smiling may lead to negative attributions.
Bericht über die 19. Arbeitstagung zur Gesprächsforschung vom 16. bis 18. März 2016 in Mannheim
(2016)
Bericht über die 19. Arbeitstagung zur Gesprächsforschung vom 16. bis 18. März 2016 in Mannheim
(2016)
Bild-Makros, auch unter dem Begriff Memes bekannt, sind populäre Internetphänomene, die im Zuge der umfassenden Multimodalisierung der Medienkommunikation als Unterhaltungsangebote auf Facebook verbreitet und kommentiert werden. Dieser Beitrag betrachtet diese aus einer Kombination von Bild und Text bestehenden multimodalen Kommunikate aus einer gattungs- und gesprächsanalytischen Perspektive, da Bild- Makros sowohl in ihrer formalen und semantischen Gestaltung als auch in der interaktiven Rezeption in Form von Kommentaren und Antworten verfestigte Muster aufzuweisen scheinen. In dieser medial vermittelten Interaktion haben sich sowohl auf der strukturellen Ebene der Interaktionssequenzen als auch innerhalb einzelner, auf sequenzexterner und sequenzinterner Ebene analysierten Interaktionseinheiten verschiedene kommunikative Muster herausgebildet. Darin nehmen soziale Prozesse wie face-work und Identitätskonstruktion Einfluss auf die interaktive Aushandlung des Kommunikats.
Brown clustering has been used to help increase parsing performance for morphologically rich languages. However, much of the work has focused on using clustering techniques to replace terminal nodes or as a feature for parsing. Instead, we choose to examine how effectively Brown clustering is for unlexicalized parsing by creating data-driven POS tagsets which are then used with the Berkeley parser. We investigate cluster sizes as well as on what information (e.g. words vs. lemmas) clustering will yield the best parser performance. Our results approach the current state of the art results for the German T¨uBa-D/Z treebank when using parser internal tagging.
This paper presents C-WEP, the Collection of Writing Errors by Professionals Writers of German. It currently consists of 245 sentences with grammatical errors. All sentences are taken from published texts. All authors are professional writers with high skill levels with respect to German, the genres, and the topics. The purpose of this collection is to provide seeds for more sophisticated writing support tools as only a very small proportion of those errors can be detected by state-of-the-art checkers. C-WEP is annotated on various levels and freely available.
German research on collocation(s) focuses on many different aspects. A comprehensive documentation would be impossible in this short report. Accepting that we cannot do justice to all the contributions to this area, we just pick out some influential comerstones. This selection does not claim to be representative or balanced, but it follows the idea to constitute the backbone of the story we want to tell: Our ‘German’ view of the still ongoing evolution of a notion of ‘collocation’ Although our own work concerns the theoretical background of and the empirical rationale for collocations, lexicography occupies a large space. Some of the recent publications ( Wahrig 2008, Häcki Buhofer et al. 2014) represent a turn to the empirical legitimation for the selection of typical expressions. Nevertheless, linking the empirical evidence to the needs of an abstract lexicographic description (or a didactic format) is still an open issue.
Comparaison de deux marqueurs d’affirmation dans des séquences de co-construction: voilà et genau
(2016)
This contribution investigates the German response particle genau and the French response particle voilà within collaborative turn sequences in videotaped ordinary conversations. Adopting a conversation analytic approach to cross-linguistic comparison, I will show that the basic epistemic value of both particles allows them to be used in similar sequential environments. When a co-participant formulates a candidate conclusion in environments where it can be easily inferred from previous talk, first speakers may confirm the adequacy of the pre-emptive completion by voilà or genau. These particles may then also be followed by self- or other-repeats. The analyses aim to illustrate that participants rely on a variety of practices in order to positively assess a pre-emptive completion, and to refute a supposed binary opposition of refusal vs. acceptance in the receipt slot.
The present paper reports the first results of the compilation and annotation of a blog corpus for German. The main aim of the project is the representation of the blog discourse structure and relations between its elements (blog posts, comments) and participants (bloggers, commentators). The data included in the corpus were manually collected from the scientific blog portal SciLogs. The feature catalogue for the corpus annotation includes three types of information which is directly or indirectly provided in the blog or can be construed by means of statistical analysis or computational tools. At this point, only directly available information (e.g., title of the blog post, name of the blogger etc.) has been annotated. We believe, our blog corpus can be of interest for the general study of blog structure or related research questions as well as for the development of NLP methods and techniques (e.g. for authorship detection).
Constructing a Corpus
(2016)
This paper is about the workflow for construction and dissemination of FOLK (Forschungs - und Lehrkorpus Gesprochenes Deutsch – Research and Teaching Corpus of Spoken German), a large corpus of authentic spoken interaction data, recorded on audio and video. Section 2 describes in detail the tools used in the individual steps of transcription, anonymization, orthographic normalization, lemmatization and POS tagging of the data, as well as some utilities used for corpus management. Section 3 deals with the DGD (Datenbank für Gesprochenes Deutsch - Database of Spoken German) as a tool for distributing completed data sets and making them available for qualitative and quantitative analysis. In section 4, some plans for further development are sketched.
Converting and Representing Social Media Corpora into TEI: Schema and best practices from CLARIN-D
(2016)
The paper presents results from a curation project within CLARIN-D, in which an existing lMWord corpus of German chat communication has been integrated into the DEREKO and DWDS corpus infrastructures of the CLARIN-D centres at the Institute for the German Language (IDS, Mannheim) and at the Berlin-Brandenburg Academy of Sciences (BBAW, Berlin). The focus is on the solutions developed for converting and representing the corpus in a TEI format.
The present paper describes Corpus Query Lingua Franca (ISO CQLF), a specification designed at ISO Technical Committee 37 Subcommittee 4 “Language resource management” for the purpose of facilitating the comparison of properties of corpus query languages. We overview the motivation for this endeavour and present its aims and its general architecture. CQLF is intended as a multi-part specification; here, we concentrate on the basic metamodel that provides a frame that the other parts fit in.
This paper presents our model of ‘MultiWord Patterns’ (MWPs). MWPs are defined as recurrent frozen schemes with fixed lexical components and productive slots that have a holistic – but not necessarily idiomatic – meaning and/or function, sometimes only on an abstract level. These patterns can only be reconstructed with corpus-driven, iterative (qualitative-quantitative) methods. This methodology includes complex phrase searches, collocation analysis that not only detects significant word pairs, but also significant syntagmatic cotext patterns and slot analysis with our UWV Tool. This tool allows us to bundle KWICs in order to detect the nature of lexical fillers for and to visualize MWP hierarchies.