Refine
Year of publication
Document Type
- Part of a Book (581) (remove)
Language
- English (581) (remove)
Keywords
- Deutsch (145)
- Korpus <Linguistik> (96)
- Englisch (42)
- Konversationsanalyse (42)
- Wörterbuch (38)
- Interaktion (30)
- Annotation (26)
- Gesprochene Sprache (24)
- Neologismus (24)
- Computerlinguistik (23)
Publicationstate
- Veröffentlichungsversion (268)
- Zweitveröffentlichung (108)
- Postprint (99)
- Preprint (2)
Reviewstate
- (Verlags)-Lektorat (248)
- Peer-Review (174)
- Verlags-Lektorat (10)
- Peer-review (4)
- (Verlags-)Lektorat (2)
- (Verlags-)lektorat (1)
Publisher
- de Gruyter (77)
- IDS-Verlag (75)
- Benjamins (65)
- Springer (31)
- Niemeyer (19)
- De Gruyter (15)
- Lang (15)
- Oxford University Press (13)
- Narr (12)
- European language resources association (ELRA) (11)
L’article intitulé «Traitement de l’information: Spinfo, HKI et humanités numériques - l’expérience de Cologne» présente l’histoire du développement des humanités numériques au sein de l’Université de Cologne. L'institutionnalisation des humanités numériques a commencé encore à l’époque où dans le monde germanophone le périmètre de la discipline était en train d’être défini par les travaux de quelques pionniers. Parmi eux, il convient de souligner le rôle d’Elisabeth Burr, active notamment à Tubingue, Duisbourg, Brême et Leipzig.L’article retrace le développement des humanités numériques à Cologne à partir de leurs débuts dans les années soixante du 20ème siècle, en passant par leur consolidation dans les années quatre-vingt-dix, jusqu’aux deux dernières décennies, quand Cologne est devenu un centre important de cette discipline. Le processus illustre comment une nouvelle discipline scientifique peut s’institutionnaliser au sein d’une université allemande. L’article décrit la perspective de deux domaines fondateurs: le traitement linguistique de l’information (en allemand: Sprachliche Informationsverarbeitung, Spinfo) et le traitement historico-culturel de l’information (en allemand: Historisch Kulturwissenschaftliche Informationsverarbeitung, HKI) et leur synthèse, qui a abouti en 2017 à la création de l’Institut des Humanités Numériques (Digital Humanities), qui aujourd’hui est - du point de vue interne - une composante de la Faculté de Philosophie de l’Université de Cologne et - du point de vue externe - une partie intégrante de la communauté internationale des humanités numériques.
“Die Sprach-Checker” (Eng. “Language Checkers”) are young citizen scientists from Mannheim’s highly diverse district Neckarstadt-West. Together with linguists, they investigate a tremendous treasure: their own multilingualism. They are exploring and (re)discovering their own languages and the other languages used in their environment while documenting and reflecting on their everyday experiences in and with different linguistic practices. Our aim is to raise awareness of their strengths and to promote appreciation for their language biographies, thus fostering a sense of identification with one’s own linguistic surroundings. Such a joint research endeavour offers empirical opportunities to address (linguistic) issues of societal relevance by collecting authentic data from the multicultural district and involving its residents and local stakeholders. In this paper, we will provide insights regarding the project’s background, conception, and outcomes. We address everyone who is planning or conducting a citizen science project with young people, especially children and adolescents, or who works at the interface between science and society.
This paper presents the IVK-Ler corpus, a longitudinal, annotated learner corpus of weekly writings produced by a group of 18 adolescents in a preparatory class. The corpus consists of 117 student texts collected between 2020 and 2021 and has a structure layered by student and text number. It includes metadata that enables researchers to analyze and track individual student progress in terms of syntactic competence and literacy. The annotation schema, manual and automatic annotation processes, and corpus representation are described in detail. The corpus currently includes target hypotheses and gold standard part-of-speech tags. Future work could include additional annotation layers for topological fields and dependency relations, as well as semantic and discourse annotations to make the corpus usable for tasks beyond syntactic evaluations.
Morphophonological asymmetries in affixation concern systematic correlations between morphological properties of affixes (e.g. combination with bound versus free stems, position relative to stem (suffixes versus prefixes)) and their phonological properties (e.g. stress behaviour). The arguably most insightful approach to capturing relevant asymmetries invokes a notion of affix coherence, first introduced by Dixon in connection with his work on Yidiɲ, a nearly extinct language spoken in Northern Australia. This notion is based on a categorical division of affixes into ones that integrate into the phonological word of the stem and ones that do not. The integration of affixes is envisioned as being fully determined by phonological and morphological structure in a given language and verifiable by diagnostics relevant to phonological word domains (primarily the syllable and the foot structure). The assumption of two types of prosodic domains characterized by integrated versus non-integrated affixes is manifest in consistent asymmetries that pertain to morphophonological, phonological, and phonetic rules. This consistency constitutes compelling evidence for the structure-based analysis of the impact of various affixes on derived words, as opposed to alternative approaches to capturing these effects by associating affixes with diacritics (morpheme versus word boundary, class 1 versus class 2, stratum 1 versus stratum 2). The present entry aims to demonstrate, mostly on the basis of data from Germanic languages, the breadth of the empirical evidence in support of a fundamental role of affix coherence. Moreover, it aims to draw attention to the various implications of affix coherence for modeling relevant generalizations, in particular the necessary reference to a level of phonological representation characterized by a specific degree of abstractness (‘phonemic’).
Conventional terminology resources reach their limits when it comes to automatic content classification of texts in the domain of expertlayperson communication. This can be attributed to the fact that (non-normalized) language usage does not necessarily reflect the terminological elements stored in such resources. We present several strategies to extend a terminological resource with term-related elements in order to optimize automatic content classification of expert-layperson texts.
In this chapter, we will investigate smartphone-based showing sequences in everyday social encounters, that is, moments in which a personal mobile device is used for presenting (audio-)visual content to co-present participants. Despite a growing interest in object-centred sequences and mundane technology use, detailed accounts of the sequential, multimodal, and material dimensions of showing sequences are lacking. Based on video data of social interactions in different languages and on the framework of multimodal interaction analysis, this chapter will explore the link between mobile device use and social practices. We will analyse how smartphone showers and their recipients coordinate the manipulation of a technological object with multiple courses of action, and reflect upon the fundamental complexity of this by-now routine joint activity.
We argue that properties with a nominal origin get transferred regularly in certain Gentian particle verb constructions to properties that are propositional insofar as they characterize the temporal structure of eventualities, understood to be described by propositional (= truth-assessable) representations of state changes. Accordingly, the oft-noted perfectivizing function of certain verbal particles like ein- in einfahren ('pull in', cf. Kühnhold 1972) is the effect of redressing a conflict at the syntax-semantics interface: On the one hand, constructions like in [die Grube]acc einfahren ('pull into the mine’) exhibit transitive syntax (Gehrke 2008), requiring that the syntactic arguments be mapped onto well-distinguished or DIFFERENT referents in the semantics (Kemmer 1993). On the other hand, in/ein codes a spatio-temporal inclusion relation between its relata, contradicting the requirement imposed by the transitive syntax. Following Brandt (2019), we submit that the interface executes a manoeuvre that delays the interpretation of part of the contradiction-inducing DIFFERENCE feature. It is not locally interpreted (semantically represented) in toto but in part passed on to the next syntactic-semantic computational cycle. Here, the passed-on meaning is interpreted in the locally customary terms, in the case at hand, as a temporal index where the post-state of the depicted eventuality does not hold.
Introduction
(2023)
This paper presents an extended annotation and analysis of interpretative reply relations focusing on a comparison of reply relation types and targets between conflictual pages and neutral pages of German Wikipedia (WP) talk pages. We briefly present the different categories identified for interpretative reply relations to analyze the relationship between WP postings as well as linguistic cues for each category. We investigate referencing strategies of WP authors in discussion page postings, illustrated by means of reply relation types and targets taking into account the degree of disagreement displayed on a WP talk page. We provide richly annotated data that can be used for further analyses such as the identification of interactional relations on higher levels, or for training tasks in machine learning algorithms.
Picnick and Sauerkraut: German–English intra-writer variation in script and language (1867–1900)
(2023)
Intra-writer variation is a wide-spread phenomenon that nevertheless has received only limited research attention so far. Different addressees, bi- and multilingualism, or changing life phases are among the factors that contribute to such variation. In a study of diary entries by one writer covering three decades (1867–1900), this chapter investigates patterns of intra-writer variation between German and English (language and script) in nineteenth-century Canada, with a special focus on single word borrowings, person reference and place names. The long-term perspective provides a unique insight into the dynamics of a bilingual writer’s emerging sociolinguistic competence as reflected by the flexible yet structured use of his resources within the social space of a bilingual community.
National Socialism, one could argue, was all about belonging: belonging to the ‘Volk’ or the ‘Volksgemeinschaft’, belonging to the ‘Aryan’ or ‘Non-Aryan race’, belonging to the National Socialist ‘movement’, and so on. These categories of belonging worked both inclusionary and exclusionary and they were constituted, proclaimed and enacted to a great part through language. What is more, they had to be performed through communicative acts. For the normative side of National Socialist propaganda and legislation, this seems rather obvious and one-directional. On the side of the general population, however, this entailed a mixture of communicative need to position oneself vis-à-vis National Socialism (mostly in affirmative ways), but also the urge to do so willingly. When we look at the language use of ‘ordinary people’ in different communicative situations and texts during National Socialism, we have to focus on these dimensions of discursive collusion, co-constitution and appropriation. People during National Socialism, such is our hypothesis, navigated through discourses of belonging and by that made them real and effective. Besides diaries, war letters and autobiographical writings, one way to grasp this phenomenon is to analyse petitions, i.e., letters of complaint and request sent in large numbers by ‘ordinary people’ to public authorities of the party and the state. As I will show by some examples, letter-writers tried to inscribe themselves within (what they took for) National Socialist discourses of belonging in order to legitimate their claims. By doing so, they co-constituted and co-created the discursive realm of National Socialism.
This article details the process of creating the Nottinghamer Korpus deutscher YouTube-Sprache ('The Nottingham German YouTube Language Corpus' - or NottDeuYTSch corpus) and outlines potential research opportunities. The corpus was compiled to analyse the online language produced by young German-speakers and offers significant opportunity for in-depth research across several linguistic fields including lexis, morphology, syntax, orthography, and conversational and discursive analysis. The NottDeuYTSch corpus contains over 33 million words taken from approximately 3 million YouTube comments from videos published between 2008 to 2018 targeted at a young, German-speaking demographic and represent an authentic language snapshot of young German speakers. The corpus was proportionally sampled based on video category and year from a database of 112 popular German-speaking YouTube channels in the DACH region for optimal representativeness and balance and contains a considerable amount of associated metadata for each comment that enable further longitudinal cross-sectional analyses. The NottDeuYTSch corpus is available for analysis as part of the German Reference Corpus (DeReKo).
In many European languages, propositional arguments (PAs) can be realized as different types of structures. Cross-linguistically, complex structures with PAs show a systematic correlation between the strength of the semantic bond and the syntactic union (cf. Givón 2001; Wurmbrand/Lohninger 2023). Also, different languages show similarities with respect to the (lexical) licensing of different PAs (cf. Noonan 1985; Givón 2001; Cristofaro 2003 on different predicate types). However, on a more fine-grained level, a variation across languages can be observed both with respect to the syntactic-semantic properties of PAs as well as to their licensing and usage. This presentation takes a multi-contrastive view of different types of PAs as syntactic subjects and objects by looking at five European languages: EN, DE, IT, PL and HU. Our goal is to identify the parameters of variation in the clausal domain with PAs and by this to contribute to a better understanding of the individual language systems on the one hand and the nature of the linguistic variation in the clausal domain on the other hand. Phenomena and Methodology: We investigate the following types of PAs: direct object (DO) clauses (1), prepositional object (PO) clauses (2), subject clauses (3), and nominalizations (4, 5). Additionally, we discuss clause union phenomena (6, 7). The analyzed parameters include among others finiteness, linear position of the PA, (non) presence of a correlative element, (non) presence of a complementizer, lexical-semantic class of the embedding verb. The phenomena are analyzed based on corpus data (using mono- and multilingual corpora), experimental data (acceptability judgement surveys) or introspective data.
The idea of this article is to take the immaterial and somehow ethereal nature of aesthetic concepts seriously by asking how aesthetic concepts are negotiated and thus formed in communication. My examples come from theatrical production where aesthetic decisions naturally play a major role. In the given case, an aesthetic concept is introduced with which only the director, but none of the actors is familiar in the beginning of the rehearsals. The concept, Wabi Sabi, comes from Japanese culture. As the whole rehearsal process was video recorded, it is possible to track the process of how the concept is negotiated and acquired over time. So, instead of defining criteria what Wabi Sabi as an aesthetic concept “consists of,” this article seeks to show how the concept is introduced, explained and “used” within a practical context, in this case a theater rehearsal. In contrast to conventional models of aesthetic experience, I am interested in the ways in which an aesthetic concept is configured in and through socially organized interaction, and — vice versa — how that interaction contributes to the situational accomplishment of the same concept. In short: I am interested in the “doing” of aesthetic concepts, especially in “doing Wabi Sabi.”
Recent typological studies have shown that socio-linguistic factors have a substantial effect on at least certain structures of language. However, we are still far from understanding how such factors should be operationalized and how they interact with other factors in shaping grammar. To address both questions, this study examines the influence of socio-linguistic factors on the number of dedicated conditional constructions in a sample of 374 languages. We test the number of speakers, the degree of multilingualism, the availability of a literature tradition, the use of writing, and the use of the language in the education system. At the same time, we control for genealogical, contact, and bibliographical biases. Our results suggest that the number of speakers is the most informative predictor. However, we find that the association between the number of speakers and the number of dedicated conditional constructions is much weaker than assumed, once genealogical and contact biases are controlled for.
‘Can’ and ‘must’-type modal verbs in the direct sanctioning of misconduct across European languages
(2023)
Deontic meanings of obligation and permissibility have mostly been studied in relation to modal verbs, even though researchers are aware that such meanings can be conveyed in other ways (consider, for example, the contributions to Nuyts/van der Auwera (eds.) 2016). This presentation reports on an ongoing project that examines deontic meaning but takes as its starting point not a type of linguistic structure but a particular kind of social moment that presumably attracts deontic talk: The management of potentially ‚unacceptable‘ or untoward actions (taking the last bread roll at breakfast, making a disallowed move during a board game, etc.). Data come from a multi-language parallel video corpus of everyday social interaction in English, German, Italian, and Polish. Here, we focus on moments in which one person sanctions another’s behavior as unacceptable. Using interactional-linguistic methods (Couper-Kuhlen/Selting 2018), we examine similarities and differences across these four languages in the use of modal verbs as part of such sanctioning attempts. First results suggest that modal verbs are not as common in the sanctioning of misconduct as one might expect. Across the four languages, only between 10%–20% of relevant sequences involve a modal verb. Most of the time, in this context, speakers achieve deontic meaning in other ways (e.g., infinitives such as German nicht so schmatzen, ‚no smacking‘). This raises the question what exactly modal verbs, on those relatively rare occasions when they are used, contribute to the accomplishment of deontic meaning. The reported study pursues this question in two ways: 1) By considering similarities across languages in the ways that modal verbs interact with other (verbal) means in the sanctioning of misconduct.; 2) By considering differences across languages in the use of modal verbs. Here, we find that the relevant modal verbs are used similarly in some activity contexts (enforcing rules during board games), but less so in other activity contexts (mundane situations with no codified rules). In sum, the presented study adds to cross-linguistically grounded knowledge about deontic meaning and its relationships to linguistics structures.
This presentation deals with collaborative turn-sequences (Lerner 2004), a syntactically coherent unit of talk that is jointly formulated by at least two speakers, in Czech and German everyday conversations. Based on conversation analysis (e.g., Schegloff 2007) and a multimodal approach to social interaction (e.g., Deppermann/Streeck 2018), we aim at comparing recurrent patterns and action types within co-constructional sequences in both languages. The practice of co-constructing turns-at-talk has been described for typologically different languages, especially for English (e.g., Lerner 1996, 2004), but also for languages such as Japanese (Hayashi 2003) or Finnish (Helasvuo 2004). For German, various forms and functions of co-constructions have already been investigated (e.g., Brenning 2015); for Czech, a detailed, interactionally based description is still pending (but see some initial observations in, e.g., Hoffmannová/Homoláč/Mrázková (eds.) 2019). Although the existence of co-constructions in different languages points to a cross-linguistic conversational practice, few explicitly comparative studies exist (see, e.g., Lerner/Takagi 1999, for English and Japanese). The language pair Czech-German has mainly been studied with respect to language contact and without specifically considering spoken language or complex conversational sequences (e.g., Nekula/Šichová/Valdrová 2013). Therefore, our second aim is to sketch out a first comparison of co-constructional sequences in German and Czech, thereby contributing to the growing field of comparative and cross-linguistic studies within conversation analysis (e.g., Betz et al. (eds.) 2021; Dingemanse/Enfield 2015; Sidnell (ed.) 2009). More specifically, we will present three main sequential patterns of co-constructional sequences, focusing on the type of action a second speaker carries out by completing a first speaker’s possibly incomplete turn-at-talk, and on how the initial speaker then responds to
this suggested completion (Lerner 2004). Excerpts from video recordings of Czech and German ordinary conversations will illustrate these recurrent co-constructional sequence types, i.e., offering help during word searches (see example 1 above), displaying understanding, or claiming independent knowledge. The third objective of this paper is to underline the participants’ orientation to similar interactional problems, solved by specific syntactic and/or lexical formats in Czech and German. Considering the more recent focus on the embodied dimension of co-constructional practices (e.g., Dressel 2020), we will also investigate the multimodal formatting of a started utterance as more or less “permeable” (Lerner 1996) for co-participant completion, the participants’ mutual embodied orientation, and possible embodied responses to others’ turn-completions (such as head nods or eyebrow flashes, cf. De Stefani 2021). More generally, this contribution reflects on the possibilities and challenges of a cross-linguistic comparison of complex multimodal sequences.
In this presentation I show first results from an ongoing study about syntactic complexity of sanctioning turns in spoken language. This study is part of a larger project on sanctioning of misconduct in social interaction in different European languages (English, German, Italian and Polish). For the study I use video recordings of different everyday settings (family breakfasts, board game interactions and car rides) with three or four participants. These data come from the Parallel European Corpus of Informal Interaction (Kornfeld/Küttner/Zinken 2023; Küttner et al. submitted). I focus on sanctioning turns with more than one turn-constructional unit (see among others for TCUs: Sacks/Schegloff/Jefferson 1974; Clayman 2013). The study asks how often TCUs are linked to each other in the different languages, for what function, and how language diversity enters into this. Note that complex sanctioning turns do not always come as complex sentences.
The International Comparable Corpus (ICC) (Kirk/Čermáková 2017; Čermáková et al. 2021) is an open initiative which aims to improve the empirical basis for contrastive linguistics by compiling comparable corpora for many languages and making them as freely available as possible as well as providing tools with which they can easily be queried and analysed. In this contribution we present the first release of written language parts of the ICC which includes corpora for Chinese, Czech, English, German, Irish (partly), and Norwegian. Each of the released corpora contains 400k words distributed over 14 different text categories according to the ICC specifications. Our poster covers the design basics of the ICC, its TEI encoding, a demonstration of using the ICC via different query tools, and an outlook on future plans.
Similar to the European Reference Corpus EuReCo (Kupietz et al. 2020), ICC follows the approach of reusing existing linguistic resources wherever possible in order to cover as many languages as possible with realistic effort in as short a time as possible. In contrast to EuReCo, however, comparable corpus pairs are not defined dynamically in the usage phase, but the compositions of the corpora are fixed in the ICC design. The approaches are thus complementary in this respect. The design principles and composition of the ICC are based on those of the International Corpus of English (ICE) (Greenbaum (ed.) 1996), with the deviation that the ICC includes the additional text category blog post and excludes spoken legal texts (see Čermáková et al. 2021 for details). ICC’s fixed-design approach has the advantage that all single-language corpora in the ICC have the same composition with respect to the selected text types and that this guarantees that the selected broad spectrum of potential influencing variables for linguistic variation is always represented. The disadvantage, however, is that this can only be achieved for quite small corpora and that the generalisability of comparative findings based on the ICC corpora will often need to be checked on larger monolingual corpora or translation corpora (Čermáková/Ebeling/Oksefjell Ebeling forthcoming). Arguing that such issues with comparability and representativeness are inevitable, in one way or the other, and need to be dealt with, our poster will discuss and exemplify the text selections in more detail.
The issue: We discuss (declarative) prepositional object clauses (PO-clauses) in the West Germanic languages Dutch (NL), German (DE), and English (EN). In Dutch and German, PO-clauses occur with a prepositional proform (=PPF, Dutch: ervan, erover, etc.; German: drauf/darauf, drüber/darüber, etc.). This proform is optional with some verbs (1). In English, by contrast, P embeds a clausal complement in the case of gerunds or indirect questions (2), however, P is obligatorily absent when the embedded CP is a that-clause in its base positionv(3a). However, when the that-clause is passivized or topicalized, the stranded P is obligatory (3b). Given this scenario, we will address the following questions: i) Are there structural differences between PO-clauses with a P/PPF and those in which the P/PPF is optionally or obligatorily omitted? ii) In particular, do PO-clauses without P/PPF structurally coincide with direct object (=DO) clauses? iii) To what extent are case and nominal properties of clauses relevant? We use wh-extraction as a relevant test for such differences.
Previous research: Based on pronominalization and topicalization data in German and Dutch, PO-clauses are different from DO-clauses independent of the presence of the PPF (see, e.g., Breindl 1989; Zifonun/Hoffmann/Strecker 1997; Berman 2003; Broekhuis/Corver 2015 and references therein) (4,5). English pronominalization and topicalization data (3b) appear to point in the same direction (Fischer 1997; Berman 2003; Delicado Cantero 2013). However, the obligatory absence of P before that-clauses in base position indicates a convergence with DO-clauses.
Experimental evidence: To provide further evidence to these questions we tested PO-clauses in all three languages for long wh-extraction, which is usually possible for DO-clauses in English and Dutch, and in German for southern regional varieties. For German and Dutch we conducted rating studies using the thermometer method (Featherston 2008). Each study contained two sets of sentences: the first set tested long wh-extraction with regular DO-clauses (6). The second set tested wh-extraction from PO-clauses with and without PPFs (7), respectively. The results show no significant difference in extraction with PO-clauses whether or not the PPF was present even for those speakers who otherwise accept long-distance extraction in German. This supports a uniform analysis of PO-clauses with and without the PPF in contrast to DO-clauses. For English we tested extraction with verbs that select for PP-objects in two configurations: V+that-clause and V+P-gerund (8) in comparison to sentences without extraction. Participants rated sentences on a scale of 1 (unnatural) to 7 (natural). We included the gerund for English as this is a regular alternative for such objects. The results show that extraction is licit in both configurations. This suggests that English PO-clauses are different from German and Dutch PO-clauses: They rather behave as DO-clauses allowing for extraction. Note though, that the availability of extraction from P+gerund also shows that PPs are not islands for extraction in English. Overall, this shows that there is a split between English vs. German/Dutch PO-clauses when the P/PPF is absent. While these clauses behave like PO-clauses in the latter languages, extraction does not show a difference between DO- and PO-clauses in English. We will discuss the results in relation to the questions i)–iii) above.
This conversation analytic study compares the use of negation particles in spoken German and Persian, namely nein/nee and na. While these particles have a range of functions in both languages (Ghaderi 2022; Imo 2017), their use in response to news remains understudied. We focus on nein/nee and na in two sequential contexts: (i) after prior disconfirmations (Extract (a)) and (ii) in response to either solicited or unsolicited informings (see Extracts (b) and (c), respectively). In both contexts, nein/nee and na mark unexpectedness and open up an opportunity space for more, but they do so in different ways and with different outcomes. Nein/nee- and na-turns after disconfirming, often minimal responses to first-position confirmable turns mark the prior as unexpected (or even contrasting with the nein/nee/na-speaker’s expectations) and thus as expandable/accountable (cf. Ford 2001; Gubina/Betz 2021). Nein/nee/na-turns after informings (e.g., announcements that display a story teller’s negative emotional stance) differ not only in sequential position but also in prosodic realization. They can be either falling or rising, but all are characterized by marked prosody, i.e., lengthening, very low onset, smiling or breathy voice, or high overall pitch. Through position and turn design features, such nein/nee- and na-turns not only mark a prior turn as counter to (normative) expectations, but may also display the speaker’s affective stance and affiliate with the affective stance of the prior interactant. By comparing the use of nein/nee and na in German and Persian in the two functions illustrated in Extracts (a) and (b/c), we will show (i) how nein/nee- and na-turns shape interactional trajectories after responsive actions and (ii) what role the particles play in managing news and stance-taking as well as epistemic and affective positioning. Apart from revealing similarities in the use of German and Persian negation particles, the results of our crosslinguistic comparison will demonstrate that even if different languages have similar practices for specific actions, the use of these practices is language- and culture-specific. This means that even similar practices in different languages have their own “collateral effects” (Sidnell/Enfield 2012), linguistic and prosodic characteristic features, and, at least sometimes, consequences for social actions accomplished in the specific language (e.g., Dingemanse/Blythe/Dirksmeyer 2014; Evans/Levinson 2009; Floyd/Rossi/Enfield (eds.) 2020; Fox et al. 2009). Our study uses the method of Conversation Analysis (Sidnell/Stivers (eds.) 2013) and draws on more than 80 hours of audio and video recordings of spontaneous interactions (co-present, via video link, and on the telephone) in everyday and institutional contexts.
Any bilingual dictionary is contrastive by nature, as it documents linguistic information between language pairs. However, the design and compilation of most bilingual dictionaries is often no more than mere lists of lexical or semantic equivalents. In internet forums, one can observe a huge interest in acquiring relevant knowledge about specific lexical items or pairs that are prone to comparison in a more comprehensive manner as they may pose lexical semantic challenges. In particular, these often concern easily confused pairs (e.g. false friends or paronyms) and new terms increasingly travelling between languages in news and social media (Šetka-Čilić/Ilić Plauc 2021). With regard to English and German, the fundamental comparative principles upon which contrastive guides should be build are either absent, or specialised contrastive dictionaries simply do not exist, e.g. comprehensive descriptive resources for false friends, paronyms, protologisms or neologisms (see Gouws/Prinsloo/de Schryver 2004). As a result, users turn to electronic resources such as Google translate, blogs and language forums for help. For example, it is English words such as muscular which have two German translations options.
These are two confusables muskulär and muskulös both of which exhibit a different semantic profile. German sensitiv/sensibel and their English formal counterparts sensitive/sensible are false friends. However, these terms are highly polysemous in both languages and have semantic features in common. Their full meaning spectrum is hardly captured in bilingual dictionaries to allow for a full comparison. Translating protologisms such as German Doppelwumms as well as more established new words is one of the most challenging problems. Currently, German neologisms such as Klimakleber are translated as climate glue (instead of climate activist glueing him-/herself onto objects) by online tools, simply causing mistakes and contextual distortion. Most challenges users face today are well-known (e.g. Rets 2016). New terms are often unregistered in dictionaries and it is often impossible to make appropriate choices between two or more (commonly misused) words between two languages (e.g. Benzehra 2007). These are all relevant problems to translators and language learners alike (e.g González Ribao 2019).
This paper calls for the implication of insights from contrastive lexicology into modern bilingual lexicography. To turn dictionaries into valuable resources and in order to create productive strategies in a learning environment, the practice of writing dictionaries requires a critical re-assessment. Furthermore, the full potential of electronic contrastive resources needs to be recognised and put into practice. After all, monolingual German lexicography has started to reflect on how users’ needs can be accounted for in specific comparative linguistic situations. Some of these ideas can be comfortably extended to bilingual reference guides. On the one hand, this paper will deliver a critical account of some English-German/German-English dictionaries and touch on the shortcomings of contemporary bilingual lexicography. On the other hand, with the help of fictitious resources I will demonstrate contrastive structures as focal points of consultations which answer some of the more frequent language questions more reliably. Among others, I will explain how we need to build user-friendly dictionaries to allow for translating false friends or easily confusable words from the source language into its target language efficiently. With regard to neologisms, I will show how discursive descriptions and definitions that are more elaborate can support language learners to learn about necessary extra-linguistic knowledge. Overall, this could improve the role of specialised dictionaries in the teaching or translating process (cf. Miliç/Sadri/Glušac 2019).
The ubiquity of smartphones has been recognised within conversation analysis as having an impact on conversational structures and on the participants’ interactional involvement. However, most of the previous studies have relied exclusively on video recordings of overall encounters and have not systematically considered what is taking place on the device. Due to the personal nature of smartphones and their small displays, onscreen activities are of limited visibility and are thus potentially opaque for both the co-present participants (“participant opacity”) and the researchers (“analytical opacity”). While opacity can be an inherent feature of smartphones in general, analytical opacity might not be desirable for research purposes. This chapter discusses how a recording set-up consisting of static cameras, wearable cameras and dynamic screen captures allowed us to address the analytical opacity of mobile devices. Excerpts from multi-source video data of everyday encounters will illustrate how the combination of multiple perspectives can increase the visibility of interactional phenomena, reveal new analytical objects and improve analytical granularity. More specifically, these examples will emphasise the analytical advantages and challenges of a combined recording set-up with regard to smartphone use as multiactivity, the role of the affordances of the mobile device, and the prototypicality and “naturalness” of the recorded practices.
In G, E, I, and H there are constructions with accusative NPs being the external argument of an infinitival, (1) to (4). In P these accusative NPs can only co-occur with an adjectival participle, (5), a construction also occurring in E, (6). The talk compares the syntactic and semantic structure of these constructions focussing on the syntactic category of the nonfinite clause, the status of the accusative NP, the status of the infinitive, restructuring effects, and embedding predicates (including aspect).
i. As to G, E, I, and H, the infinitival clause is regarded as a TP, i.e., a small clause. Its accusative NP and infinitival predicate form a unit – [4], [12], [8]. The AcI denotes, according to [4], an eventuality, which prevents it from being negated. Its subject is case marked by the matrix predicate, either by ECM or subject-to-object raising – [9] and [10]. AcI-constructions can show clause union effects, (7). H additionally allows Dative subjects in infinitive clauses, the latter only being licensed by impersonal predicates and co-occurring with an agreeing infinitive, (8a), – [3]. In case there is no agreeing infinitive, the Dative NP is the experiencer of the matrix clause, (8b). As for Italian, it allows Nominative subject NPs in the infinitive clause, (9a, b).
ii. As to P, small clause constructions differ structurally from E, G, I and H ones – [6], [7]. P small clauses are realizable by copula constructions with verbal być ‘be’ pronominal to ‘it’, (10), or “dual” copula elements, (cooccurrence of a pronominal and a verbal element, [1]), varying with respect to selectional restrictions (part of speech or case within complement phrases, extraction possibilities, [1]). The P counterpart to the AcI-constructions is the secondary predication over an accusative object via an adjectival present participle, (5), (11) and (12). The adjectival participle construction is systematically paraphrasable via clauses introduced by jak ‘how’ (11’) and (12’). In Polish, adjectival phrases like recytującego wiersz ‘reciting’, (11), and wracającego z podróży ‘returning’, (12), clearly function as adjuncts of the accusative object go ‘him’. In our talk, we will compare this P view to languages with typical AcI-constructions, where the AcI-clause is standardly analyzed as a complement of a matrix verb.
A central goal of linguistics is to understand the diverse ways in which human language can be organized (Gibson et al. 2019; Lupyan/Dale 2016). In our contribution, we present results of a large scale cross-linguistic analysis of the statistical structure of written language (Koplenig/Wolfer/Meyer 2023) we approach this question from an information-theoretic perspective. To this end, we conduct a large scale quantitative cross-linguistic analysis of written language by training a language model on more than 6,500 different documents as represented in 41 multilingual text collections, so-called corpora, consisting of ~3.5 billion words or ~9.0 billion characters and covering 2,069 different languages that are spoken as a native language by more than 90% of the world population. We statistically infer the entropy of each language model as an index of un. To this end, we have trained a language model on more than 6,500 different documents as represented in 41 parallel/multilingual corpora consisting of ~3.5 billion words or ~9.0 billion characters and covering 2,069 different languages that are spoken as a native language by more than 90% of the world population or ~46% of all languages that have a standardized written representation. Figure 1 shows that our database covers a large variety of different text types, e.g. religious texts, legalese texts, subtitles for various movies and talks, newspaper texts, web crawls, Wikipedia articles, or translated example sentences from a free collaborative online database. Furthermore, we use word frequency information from the Crúbadán project that aims at creating text corpora for a large number of (especially under-resourced) languages (Scannell 2007). We statistically infer the entropy rate of each language model as an information-theoretic index of (un)predictability/complexity (Schürmann/Grassberger 1996; Takahira/Tanaka-Ishii/Dębowski 2016). Equipped with this database and information-theoretic estimation framework, we first evaluate the so-called ‘equi-complexity hypothesis’, the idea that all languages are equally complex (Sampson 2009). We compare complexity rankings across corpora and show that a language that tends to be more complex than another language in one corpus also tends to be more complex in another corpus. This constitutes evidence against the equi-complexity hypothesis from an information-theoretic perspective. We then present, discuss and evaluate evidence for a complexity-efficiency trade-off that unexpectedly emerged when we analysed our database: high-entropy languages tend to need fewer symbols to encode messages and vice versa. Given that, from an information theoretic point of view, the message length quantifies efficiency – the shorter the encoded message the higher the efficiency (Gibson et al. 2019) – this indicates that human languages trade off efficiency against complexity. More explicitly, a higher average amount of choice/uncertainty per produced/received symbol is compensated by a shorter average message length. Finally, we present results that could point toward the idea that the absolute amount of information in parallel texts is invariant across different languages.
Interactants who encounter co-participant conduct which they find to be socio-normatively problematic or troublesome are faced with a range of choices. First and foremost, this includes the issue of whether to directly address it, or to simply ‘let it pass’ (at least for now) (Emerson/Messinger 1977). In the case of the former, the issue then becomes how to address it. Across the various ways in which participants can pragmatically engage with what they perceive to be transgressive or untoward behavior (e.g., Pomerantz 1978; Schegloff 1988b; Dersley/Wootton 2000; Günthner 2000; Bolden/Robinson 2011; Potter/Hepburn 2020; see also Rodriguez 2022), they sometimes meta-pragmatically formulate the co-participant’s doings in terms of specific actions. Such action descriptions are necessarily selective (Sacks 1963; Schegloff 1972, 1988a; Sidnell/Barnes 2013): They foreground certain aspects of the co-participant’s conduct, while backgrounding others, and thus contribute to publically construeing the formulated conduct in particular ways (Jayyusi 1993), viz. as socio-normatively problematic, transgressive or untoward, and interactionally accountable (Robinson 2016; Sidnell 2017).
It is well known that the distribution of lexical and grammatical patterns is size- and register-sensitive (Biber 1986, and later publications). This fact alone presents a challenge to many corpus-oriented linguistic studies focusing on a single language. When it comes to cross-linguistic studies using corpora, the challenge becomes even greater due to the lack of high-quality multilingual corpora (Kupietz et al. 2020; Kupietz/Trawiński 2022), which are comparable with respect to the size and the register. That was the motivation for the creation of the European Reference Corpus EuReCo, an initiative started in 2013 at the Leibniz Institute for the German Language (IDS) together with several European partners (Kupietz et al. 2020). EuReCo is an emerging federated corpus, with large virtual comparable corpora across various languages and with an infrastructure supporting contrastive research. The core of the infrastructure is KorAP (Diewald et al. 2016), a scalable open-source platform supporting the analysis and visualisation of properties of texts annotated by multiple and potentially conflicting information layers, and supporting several corpus query languages. Until recently, EuReCo consisted of three monolingual subparts: the German Reference Corpus DeReKo (Kupietz et al. 2018), the Reference Corpus of Contemporary Romanian Language (Barbu Mititelu/Tufiş/Irimia 2018), and the Hungarian National Corpus (Váradi 2002). The goal of the present submission is twofold. On the one hand, it reports about the new component of EuReCo: a sample of the National Corpus of Polish (Przepiórkowski et al. 2010). On the other hand, it presents the results of a new pilot study using the newly extended EuReCo. This pilot study investigates selected Polish collocations involving light verbs and their prepositional / nominal complements (Fig. 1) and extends the collocation analyses of German, Romanian and Hungarian (Fig. 2) discussed in Kupietz/Trawiński (2022).
It is a ubiquitous phenomenon of everyday interaction that participants confront their co-participants for behaviour that they assess as undesirable or in some other way untoward. In a set of video data of informal interaction from the PECII corpus (Parallel European Corpus of Informal Interaction), cases of such sanctions have been collected in English, German, Italian and Polish data. This study presents work in progress and focuses on interrogatively formatted sanctions, in particular on non-polar interrogatives. It has already been shown that interrogatives can do much more than ask questions (Huddleston 1994). They can also function as directives (Lindström et al. 2017) or, more specifically, as requests (Curl/Drew 2008), as invitations (Margutti/Galatolo 2018) or reproaches (Klattenberg 2021), among others. What makes them interesting for cross-linguistic comparison is that the four languages that are considered provide different morphological and (morpho-)syntactical ressources for the realization of interrogative phrases. For example, German provides the option of building in the modal particle denn that reveals a previous lack of clarity and obliges the co-participant(s) to deliver the missing information (Deppermann 2009). Of course, the other three languages have modal particles, too (e.g. allora in Italian or though in English), but they do not seem to convey the same semantic and interactional qualities as denn. From an interactional point of view, one could think that interrogatives are a typical and effective way of solliciting accounts, since formally they open up a conditionally relevant space for an answer or a
reaction. But as the data shows, this does not guarantee that they are actually responded to. Another relevant aspect in the context of sanctions is that the interrogative format seems to carry a certain ‚openness‘ that might be seen as a mitigating effect and thus provides an interesting point of comparison with other mitigating devices. This study uses the methods of conversation analysis and interactional linguistics. It is based on a collection of 148 interrogative sanctions (out of which 84 are non-polar interrogatives) covering the four languages. I draw on coded data from roughly 1000 cases to get a first overall idea of how the interrogative format might differ from other formats, and how it might interrelate with specific features – for example, if subsequently an account is delivered. Going more into depth, the interrogative sanctions will then be analyzed with respect to their formal design (e.g. polar questions vs. content questions vs. tag questions, Rossano 2010; Hayano 2013) and to their pragmatic implications. I also analyze reactions to such sanctions – both formally (cf. Enfield et al. 2019, 279) and, again, from an interactional perspective (e.g. acceptance/compliance vs. challenging/defiance; Kent 2012; Cekaite 2020). A more detailed zooming in on the sequential unfolding of some particularly interesting
instances of sanctioning interrogatives will make the picture complete.
Contrastive analysis of climate-related neologisms registered in GermanN and French Wikipedia
(2023)
Neologisms represent new social norms, tendencies, controversies and attitudes. They denote new or changed concepts which are constantly being negotiated between different members of the discourse community (Wodak 2022 and Catalano/Waugh (eds.) 2020). Neologisms help to identify new communicative patterns and narratives which illustrate different strings of discourse in everyday life. In recent years, many neologisms relating to the subject of the environment and climate have been emerging around the world mainly due to dominant discussions on climate change and the movement “Fridays for Future”. In German, for example, neologisms such as Klimakleber, klimaresilient and globaler Streik and in French neologisms such as éco-anxiété, justice climatique and écocitoyen could be observed. These neologisms occur in many domains of life, for example in politics, media and also in advertising, which means that “l’importance croissante des enjeux environnementaux dans les discours politiques, médiatiques et publicitaires” (Balnat/Gérard 2022, p. 22) can be identified. However, it is not only the occurrence of environment- or climate-related topics that is increasing, but also the rising polarisation of the public debate. The polarisation within public discourse is based on the fact that there are opposing positions which are represented by new or recently relevant terms such as activistes du climat (or Klimaaktivisten) and climatosceptiques (or Klimaskeptiker) (Balnat/Gérard 2022, p. 22). Due to different identifications with one or the other side, one can also speak of an “affrontement idéologique” (Balnat/Gérard 2022, p. 23). 1 The explosive nature and the high complexity of the debate on climate and the environmental issues mean that many words are naturally unfamiliar to people. This is especially true with regard to neologisms. In addition, it is often not only the new word itself but also the signified concept that is initially unknown. When people then look up words, they often do so on the Internet. Wikipedia as a “free encyclopedia” (Wikipedia 2023) is particularly well suited as an object of study with regard to neologisms, since factual knowledge is given special attention there. Furthermore, this reference guide is perceived as a regular source of agreed and common knowledge on all sorts of subjects. Hence, the descriptions found here represent social agreement on controversial terms and discussions to some degree. In this paper, German and French neologisms from the subject area of climate and environment will be examined primarily in Wikipedia, but also in the neighbouring resource Wiktionary,2 which is “a collaborative project to produce a free-content multilingual dictionary” (Wiktionary 2023). Since Wikipedia and Wiktionary are available in French and in German, 21010. International Contrastive Linguistics Conference (ICLC) both are equally suitable for the contrastive analysis. Thus, Wikipedia articles which are accessible in both languages (e.g. Klimanotstand and État d›urgence climatique) or Wikipedia articles about similar events and phenomena (e.g. Letzte Generation and Dernière Rénovation) will be compared. For example, we will have a closer look at other new terms specifying different thematic aspects of the discourse of climate and environment. We will mainly refer to those lexical items which can be found in the respective articles in both languages. Special emphasis will be on overlaps and differences, thematic foci, speaker’s positions and evaluative terms.
Our everyday lives in any social community are shaped by rules (e.g., Roughley 2019; Schmidt/Rakoczy 2019). Rules (in a broad sense) are interactionally negotiated, monitored, enforced, and serve as an ‘orientation value‘ in social life. If someone‘s behavior is treated as norm-violating or problematic in certain way, it may be therefore confronted. Confronting interlocutors can immediately stop, modify, or retrospectively reprimand the misconduct of others in a moralizing manner. Such confrontations of a problem behavior occur commonly in informal interactions. On the basis of our corpus, specifically in informal interactions at the table, I observed that, for example, in Polish, German and British English, direct confrontations occur on average at least once every three minutes. Participants design these actions in a variety of ways, but like everything in interaction, the design is not arbitrary (Sacks 1984; Enfield/Sidnell 2019). A recurrent feature of such turns is connecting misconduct to some more general concepts. It is evident from the data that e.g. speakers of German and Polish use ‘generally valid statements’ in problematic moments (cf. Küttner/Vatanen/Zinken 2022) to reach the closure of the problem sequence, also specifically dealing there with distribution of deontic and epistemic rights (Rogowska in prep.). I ask, when and for what purpose generality, that is, abstracting from a concrete behaviour, is used as a tool while confronting others. The focus is on sequential and linguistic features of abstracting in confronting moments in language comparison. What are the methods to achieve abstraction: i) defocusing the confronted, specific agent (cf. Zinken et al. 2021; Siewierska 2008), e.g. nur derjenige der dran ist der darf die bedingungen für den handel stellen (only the one whose turn it is may set the conditions for the trade); using ii) extreme case formulations (Pomerantz 1986), e.g. na siostrę zawsze można liczyć (you can always count on a sister); iii) referring to stable character traits, e.g. Matylda bardzo chetne by podala. (.) Ona jest taka skora do pomocy (Matylda would be very happy to pass (it to you). (.) She is so eager to help); or iv) broader categorizing of the given referent, e.g. do not build (.) do do not build do not build swastikas (when a) German guy is filming us? Sometimes, even several locus of abstraction are combined in the same turn. Can we identify language-specific and cross-linguistic patterns? What are the interactional consequences: enforcing a compliant behavior in the future, eliciting an apology or cognitively simplifying complex problems? From a comparative perspective, I ask whether going beyond the here-and-now while confronting others is a practice that unites speakers across languages and is thus a human cognitive strategy to display normativity. This ongoing study is based on new comparable data from four European languages from informal interaction during activities around the table (Kornfeld/Küttner/Zinken 2023; Küttner et al. in prep.). The phenomenon was coded systematically in each of the four languages as part of a larger, quantitatively oriented study with different questions (Küttner et al. submitted). In the talk, I will show exemplarily Polish and German evidence. I use the methods of Conversation Analysis (Sidnell/Stivers (eds.) 2012) and Interactional Linguistics (Imo/Lanwer 2019).
Tense, aspect, and mood are grammatical categories concerned with different notional facets of the event or situation conveyed by a given clause. They are prototypically expressed by the verbal system. Tense can be defined as a category that relates points or intervals in time to one another; in a most basic model, those include the time of the event or situation referred to and the speech time. The former may precede the latter (“past”), follow it (“future”), or be simultaneous with it (or at least overlap with it; “present”). Aspect is concerned with the internal temporal constituency of the event or situation, which may be viewed as a single whole (“perfective”) or with particular reference to its internal structure (“imperfective”), including its being ongoing at a certain point in time (“progressive”). Mood, in a narrow, morphological sense, refers to the inflectional realization of modality, with modality encompassing a large and varying set of sub-concepts such as possibility, necessity, probability, obligation, permission, ability, and volition. In the domain of tense, all Germanic languages make a distinction between non-past and past. In most languages, the opposition can be expressed inflectionally, namely, by the present and preterite (indicative). All modern languages also have a periphrastic perfect as well as periphrastic forms that can be used to refer to future events. Aspect is characteristically absent as a morphological category across the entire family, but most, if not all, modern languages have periphrastic forms for the expression of aspectual categories such as progressiveness. Regarding mood, Germanic languages are commonly described as distinguishing up to three such form paradigms, namely, indicative, imperative, and a third one referred to here as subjunctive. Morphologically distinct subjunctive forms are, however, more typical of earlier stages of Germanic than they are of most present-day languages.
This paper has two distinct but interdependent goals. The empirical and analytical primary goal is to present a detailed overview of the patterns of (syntactico-semantic) argument structure and (morpho-syntactic) argument realization found with clause-embedding predicates in German. In particular, it will elucidate the observable relationships and dependencies between them, with a special focus on prepositional object clauses. The methodological secondary goal is to demonstrate the recently published ZAS Database of Clause-Embedding Predicates and illustrate its usefulness in approaching a concrete research agenda. The goals are aligned with each other because the data on patterns of argument structure and realization were collected using the database, and indeed the relevant questions could not have been investigated in such a thorough and efficient way without it. We will begin in Part 1 with an introduction to the database, its structure, and why and how it was created, before moving in Part 2 to the presentation of the data and analysis of argument structure and argument realization.
The present article proposes a syntactic and semantic analysis of assertive clauses that comprises their truth-conditional aspects and their speech act potential in communication. What is commonly called “illocutionary force” is differentiated into three structurally and functionally distinct layers: a judgement phrase, representing subjective epistemic and evidential attitudes; a commitment phrase, representing the social commitment related to assertions; and an act phrase, representing the relation to the common ground of the conversation. The article provides several pieces of evidence for this structure: from the interpretation and syntactic position of various classes of epistemic, evidential, affirmative and speech act-related operators, from clausal complements embedded by different types of predicates, from embedded root clauses, and from anaphora referring to different clausal projections. The syntactic assumptions are phrased within X-bar theory, and the semantic interpretation makes use of dynamic update of common ground, differentiating between informative and performative updates. The object language is German, with particular reference to verb final and verb second structure.
This paper first argues that the distinction between Propositions and States-of-Affairs is significant for understanding a number of linguistic contrasts, including contrasts between nominalizations, complement clauses, readings of modal infinitives, raising constructions, illocutions and moods, relative clauses, and nouns. Subsequently, the paper outlines a cognitive linguistic model of the distinction, according to which Propositions and States-of-Affairs differ in terms of construal. Both prompt Langackerian “processes”, but only Propositions prompt a construal of these processes as referential. The paper argues that this model has a number of advantages over a traditional, denotational understanding of the distinction.
When comparing different tools in the field of natural language processing (NLP), the quality of their results usually has first priority. This is also true for tokenization. In the context of large and diverse corpora for linguistic research purposes, however, other criteria also play a role – not least sufficient speed to process the data in an acceptable amount of time. In this paper we evaluate several state of the art tokenization tools for German – including our own – with regard to theses criteria. We conclude that while not all tools are applicable in this setting, no compromises regarding quality need to be made.
Words and their usages are in many cases closely related to or embedded in social, cultural, technical and ideological contexts. This does not only apply to individual words and specific senses, but to many vocabulary zones as well. Moreover, the development of words is often related to aspects of socio-cultural evolution in a broad sense. In this paper I will have a look at traditional dictionaries and digital lexical systems focussing on the question how they deal with socio-cultural and discourse-related aspects of word usage. I will also propose a number of suggestions how future digital lexical systems might be enriched in this respect.
The public as linguistic authority: Why users turn to internet forums to differentiate between words
(2022)
This paper addresses the question of why we face unsatisfactory German dictionary entries when looking up and comparing two similar lexical terms that are loan words, new words, (near) synonyms, or confusables. It explains how users are aware of existing reference works but still search or post on language forums, often after consulting a dictionary and experiencing a range of dictionary based problems. Firstly, these dictionary based difficulties will be scrutinised in more detail with respect to content, function, presentation, and the language of definitions. Entries documenting loan words and commonly confused pairs from different lexical reference resources serve as examples to show the short comings. Secondly, I will explain why learning about your target group involves studying discussion forums. Forums are a valuable source for detailed user studies, enabling the examination of different communicative needs, concrete linguistic questions, speakers’ intuitions, and people’s reactions to posts and comments. Thirdly, with the help of two examples I will describe how the study of chats and forums had a major impact on the development of a recently compiled German dictionary of confusables. Finally, that same problem solving approach is applied to the idea of a future dictionary of neologisms and their synonyms.
eThis paper first attempts a state-of-the art overview of what is known about women in the history of lexicography up to the early twentieth century. It then focusses more closely on the German and German-English lexicographical traditions to 1900, examining them from three different perspectives (following Russell’s 2018 study of women in English lexicography): women as users and dedicatees of dictionaries; women as contributors to and compilers of lexicographical works; and (in a very preliminary way) women and female sexuality as represented in German/English bilingual dictionaries of the eighteenth and early nineteenth centuries. Russell (2018) was able to identify some 24 dictionaries invoking women as patrons, dedicatees or potential users before 1700, and some 150 works in English lexicography by women between 1500 and 1900, besides the contribution of hundreds of women as supporters and helpers, not least as unpaid readers and sub-editors for the Oxford English Dictionary. Equivalent research in other languages is lacking, but this paper presents some of the known examples of women as lexicographers. The evidence tends to support Russell’s finding for English, that women were more likely to find a place in lexicography outside the mainstream: sometimes in a more private sphere (like Hester Piozzi); often in bilingual lexicography (such as Margrethe Thiele, working on a Danish-French dictionary), including missionary and or colonizing activity (such as Cinie Louw in Africa, Daisy Bates in Australia); and in dialect description (Coronedi Berti in Italy, Luisa Lacal and María Moliner in Spain). Within the German-speaking context, women who participated in lexicographical work themselves are hard to identify before the late nineteenth century, though those few women who did have access to education were often engaged in language learning, including translation activity, and they were likely users of bilingual and multilingual dictionaries. Christian Ludwig’s (1706) English-German dictionary – the first of its kind – was dedicated to the Electoral Princess Sophia of Hanover. Elizabeth Weir may have been the first named female compiler of a German dictionary, with her bilingual New German Dictionary (1888). Rather better known are the cases of Agathe Lasch and Luise Pusch, who, as pioneering women in the field of German linguistics, ultimately led major lexicographical projects documenting German regional varieties in the first half of the twentieth century (Middle Low German and Hamburgish in the case of Lasch; the Hessisch Nassau dialect dictionary in the case of Berthold). In the light of existing research on gender and sexuality in the history of English lexicography (e. g. Iamartino 2010; Turton 2019), I conclude with a preliminary exploration how woman and sexuality have been represented in dictionaries of German and English, taking the words Hure and woman in bilingual German-English dictionaries of the eighteenth and nineteenth centuries as my case studies.
This paper focuses on the treatment of culture bound lexical items in a novel type of online learner’s dictionary model, the Phrase Based Active Dictionary (PAD). A PAD has a strong phraseological orientation: each meaning of a word is exclusively defined in a typical phraseological context. After introducing the relevant theory of realia in translation studies, we develop a broader notion of culture specific lexical items which is more apt to serve the purposes of learner’s lexicography and thus to satisfy the needs of a larger and often undefined target group. We discuss the treatment of such words and expressions in common English learner’s dictionaries and then present various excerpts from PAD entries in English, German, and Italian which display different strategies for coping with cultural contents in the lexicon. Our aim is to demonstrate that the phraseological approach at the core of the PAD model turns out to be extremely important to convey cultural knowledge in a suitable way for users to fully grasp cultural implications in language.
In foreign language teaching the use of dictionaries, especially bilingual, has always been related to the hypotheses concerning the relationship between the native language (L1) and second language acquisition method. If the bilingual dictionary was an obvious tool in the grammar-translation method, it was banned from the classroom in the direct, audiolingual and audiovisual methods. Also in the communicative method, foreign language learners are discouraged from using a dictionary. Its use should not obstruct the goals of communicatively oriented foreign language learning – a view still held by many foreign language teachers. Nevertheless, the reality has been different: Foreign language learners have always used dictionaries, even if they no longer possess a print dictionary and mainly use online resources and applications. Dictionaries and online resources will continue to play an important role in the future. In the Council of Europe’s language policy, with its emphasis on multilingualism and lifelong learning, the adequate use of reference tools as a strategic skill is highlighted. In several European countries, educational guidelines refer to the use of dictionaries in the context of media literacy, both in mother tongue and foreign language teaching. Not only is their adequate use important, but so too is the comparison, assessment and evaluation of the information presented, in order to develop Language Awareness and Language Learning Awareness. This is good news. However, does this mean that dictionaries are actually used in class? What role do dictionaries play in foreign language teaching in schools and universities? Are foreign language learners in the digital era really competent users? And how competent are their teachers? Are they familiar with the current (online) dictionary landscape? Can they support their students? After a more in-depth study of the status quo of dictionary use by foreign language learners and teachers and the gap between their needs and the reality, this contribution discusses the challenges facing lexicographers and meta-lexicographers and what educational policy measures are necessary to make their efforts worthwhile in turning foreign language learners – and their teachers – into competent users in a multilingual and digital world.
The aim of this paper is to show how lexicographical choices reflect ideological thinking, singled out by Eagleton (2007) into the strategies of rationalizing, legitimating, action orienting, unifying, naturalizing and universalizing. It will be carried out by examining two twenty first century editions of each of the five English monolingual learner’s dictionaries published by Cambridge, Collins, Longman, Macmillan, and Oxford. The synchronic and diachronic analyses of the dictionaries and their different editions at the macro structural level (the wordlists) and at the micro structural level (the definitional styles) will show how the reduction and change of data, derived from heterogeneous social and cultural contexts of language use, to abstract essential forms, involves decisions about the central and peripheral aspects of the lexicon and the meaning of words.
Applying terminological methods to lexicography helps lexicographers deal with the terms occurring in general language dictionaries, especially when it comes to writing the definitions of concepts belonging to special fields. In the context of the lexicographic work of the Dicionário da Língua Portuguesa, an updated digital version of the last Academia das Ciências de Lisboa’ dictionary published in 2001, we have assumed that terminology – in its dual dimension, both linguistic and conceptual – and lexicography are complementary in their methodological approaches. Both disciplines deal with lexical items, which can be lexical units or terms. In this paper, we apply terminological methods to improve the treatment of terms in general language dictionaries and to write definitions as a form of achieving more precision and accuracy, and also to specify the domains to which they belong. Additionally, we highlight the consistent modelling of lexicographic components, namely the hierarchy of domain labels, as they are term identification markers instead of a flat list of domains. The need to create and make available structured, organised and interoperable lexicographic resources has led us to follow a path in which the application of standards and best practices of treating and representing specialised lexicographic content are fundamental requirements.
In a multilingual and multicultural society, dictionaries play an important role to enhance interlingual communication. A diversity of languages and different levels of dictionary culture demand innovative lexicographic approaches to establish a dictionary landscape that responds to the needs of the various speech communities. Focusing on the South African situation this paper discusses some aspects of a few dictionaries that contributed to an improvement of the local dictionary landscape. Using the metaphors of bridges, dykes and sluice gates it is shown how lexicographers need a balanced approach in their lemma selection and treatment. Whilst a too strong prescriptive approach can be to the detriment of the macrostructural selection, a lack of regulatory criteria could easily lead to a data overload. The lexicographer should strive to give a reflection of the actual language use and enable the users to retrieve the information that can satisfy their specific communication and cognitive needs. Such lexicographic products will enrich and improve the dictionary landscape.
Phonesthemes (Firth 1930) are sublexical constructions that have an effect on the lexico-grammatical continuum: they are recurring form-meaning associations that occur more often than by chance but not systematically (Abramova/Fernandez/Sangati 2013). Phonesthemes have been shown (Bergen 2004) to affect psycholinguistic language processing; they organise the mental lexicon. Phonesthemes appear over time to emerge as driven by language use as indexical rather than purely iconic constructions in the lexicon (Smith 2016; Bergen 2004; Flaksman 2020). Phonesthemes are acknowledged in construction morphology (Audring/Booij/Jackendoff 2017) as motivational schemas. Some phonesthemes also tend to have lexicographic acknowledgment, as shown by etymologist Liberman (2010), although this relevance and cohesion appears to be highly variable as we will show in this paper.
This paper describes a method for extracting collocation data from text corpora based on a formal definition of syntactic structures, which takes into account not only the POS-tagging level of annotation but also syntactic parsing (syntactic treebank model) and introduces the possibility of controlling the canonical form of extracted collocations based on statistical data on forms with different properties in the corpus. Specifically, we describe the results of extraction from the syntactically tagged Gigafida 2.1 corpus. Using the new method, 4,002,918 collocation candidates in 81 syntactic structures were extracted. We evaluate the extracted data sample in more detail, mainly in relation to properties that affect the extraction of canonical forms: definiteness in adjectival collocations, grammatical number in noun collocations, comparison in adjectival and adverbial collocations, and letter case (uppercase and lowercase) in canonical forms. The conclusion highlights the potential of the methodology used for the grammatical description of collocation and phrasal syntax and the possibilities for improving the model in the process of compilation of a digital dictionary database for Slovene.
Recent years have seen a growing interest in linguistic phenomena that challenge the received division of labour between lexicon and grammar, and hence often fall through the cracks of traditional dictionaries and grammars. Such phenomena call for novel, pattern based types of linguistic reference works (see various papers in Herbst 2019). The present paper introduces one such resource: MAP (“Musterbank argumentmarkierender Präpositionen”), a web based corpus linguistic patternbank of prepositional argument structure constructions in German. The paper gives an overview of the design and functionality of the MAP prototype currently developed at the Leibniz Institute for the German Language in Mannheim. We give a brief account of the data and our analytic workflow, illustrate the descriptions that make up the resource and sketch available options for querying it for specific lexical, semantic and structural properties of the data.
In this paper, we present LexMeta, a metadata model for the description of human-readable and computational lexical resources in catalogues. Our initial motivation is the extension of the LexBib knowledge graph with the addition of metadata for dictionaries, making it a catalogue of and about lexicographical works. The scope of the proposed model, however, is broader, aiming at the exchange of metadata with catalogues of Language Resources and Technologies and addressing a wider community of researchers besides lexicographers. For the definition of the LexMeta core classes and properties, we deploy widely used RDF vocabularies, mainly Meta-Share, a metadata model for Language Resources and Technologies, and FRBR, a model for bibliographic records.
This paper presents a multilingual dictionary project of discourse markers. During its first stage, consisting of collecting the list of headwords, we used a parallel corpus to automatically extract units from texts written in Spanish, Catalan, English, French and German. We also applied a method to create a taxonomy structure for automatically organising the markers in clusters. As a result, we obtain an extensive, corpus-driven list of headwords. We present a prototype of the microstructure of the dictionary in the form of a standard XML database and describe the procedure to automatically fill in most of its fields (e.g., the type of DM, the equivalents in other languages, etc.), before human intervention.
In this paper we present Trendi, a monitor corpus of written Slovene, which has been compiled recently as part of the SLED (Monitor corpus and related resources) project. The methodology and the contents of the corpus are presented, as well as the findings of the survey that aimed to identify the needs of potential users related to topical language use. The Trendi corpus currently contains news articles and other web content from 110 different sources, with the texts being collected and linguistically annotated on a daily basis. The corpus complements Gigafida 2.0, a 1.13-billion-word reference corpus of standard written Slovene. Also discussed are the ways in which the corpus will be integrated into various lexicographic projects, helping not only in the identification of neologisms but also in monitoring changes in already identified language phenomena.
In this paper, we propose a controlled language for authoring technical documents and report the status of its development, while maintaining a specific focus on the Japanese automotive domain. To reduce writing variations, our controlled language not only defines approved and unapproved lexical elements but also prescribes their preferred location in a sentence. It consists of components of a) case frames, b) case elements, c) adverbial modifiers, d) sentence-ending functions, and e) connectives, which have been developed based on the thorough analyses of a large-scale text corpus of automobile repair manuals. We also present our prototype of a writing assistant tool that implements word substitution and reordering functions, incorporating the constructed controlled language.
Many European languages have undergone considerable changes in orthography over the last 150 years. This hampers the application of modern computer-based analysers to older text, and hence computer-based annotation and studies of text collections spanning a long period. As a step towards a functional analyser for Norwegian texts (Nynorsk standard) from the 19th century, funding was granted in 2020 for creating a full form generator for all inflected forms of headwords found in Ivar Aasen’s dictionary published in 1873 (Aasen 1873) and his grammar from 1864 (Aasen 1864). Creating this word bank led to new insight in Aasen (1873), its structure, internal organisation, and ambition level as well as its link to Aasen (1864). As a test, the full form list generated from this new word bank was used to analyse the word inventory of texts by Aa. O. Vinje, written in the period 1850–1870. The Vinje texts were also analysed using a full form list of modern standard Norwegian, to study the differences in applicability and see how Vinje’s language relates to the written standard of modern Norwegian.
The paper presents the process of developing the AirFrame database, a specialized lexical resource in which aviation terminology is defined in the form of semantic frames, following the methodology of the Berkeley FrameNet (FN). First, the structure of the database is presented, and then the methodology applied in developing and populating the database is described. The link between specialized aviation frames and general language semantic frames, of which frames defining entities, processes, attributes and events are particularly relevant, is discussed on the example of the semantic frame of Flight and its related frames. The paper ends with discussing possibilities of using AirFrame as a model for further developing resources in which general and specialized knowledge are linked.
In the currently ongoing process of retro-digitization of Serbian dialectal dictionaries, the biggest obstacle is the lack of machine readable versions of paper editions. Therefore, one essential step is needed before venturing into the dictionary-making process in the digital environment – OCRing the pages with the highest possible accuracy. Successful retro-digitization of Serbian dialectal dictionaries, currently in progress, has shown a dire need for one basic yet necessary step, lacking until now – OCRing the pages with the highest possible accuracy. OCR processing is not a new technology, as many opensource and commercial software solutions can reliably convert scanned images of paper documents into digital documents. Available software solutions are usually efficient enough to process scanned contracts, invoices, financial statements, newspapers, and books. In cases where it is necessary to process documents that contain accented text and precisely extract each character with diacritics, such software solutions are not efficient enough. This paper presents the OCR software called “SCyDia”, developed to overcome this issue. We demonstrate the organizational structure of the OCR software “SCyDia” and the first results. The “SCyDia” is a web-based software solution that relies on the open-source software “Tesseract” in the background. “SCyDia” also contains a module for semi-automatic text correction. We have already processed over 15,000 pages, 13 dialectal dictionaries, and five dialectal monographs. At this point in our project, we have analyzed the accuracy of the “SCyDia” by processing 13 dialectal dictionaries. The results were analyzed manually by an expert who examined a number of randomly selected pages from each dictionary. The preliminary results show great promise, spanning from 97.19% to 99.87%.
Wortgeschichte digital (Digital Word History) is an emerging historical dictionary of the German language that focuses on describing semantic shifts from about 1600 through today. This article provides deeper insight into the dictionary’s “cross-reference clusters,” one of its software tools that performs visualization of its reference network. Hence, the clusters are a part of the project’s macrostructure. They serve as both a means for users to find entries of interest and a tool to elucidate relations among dictionary entries. Rather than delve into technical aspects, this article focuses on the applied logics of the software and discusses the approach in light of the dictionary’s microstructure. The article concludes with some considerations about the clusters’ advantages and limitations.
While there was arguably a need for multi authored, multi volume, metalexicographic handbooks three decades ago – when the field of metalexicography was still ‘young’ – it is a bit puzzling to make sense of the current output flurry in this field. Is it simply a matter of ‘every publisher trying to fill its shelves’? or is there really a need in the scientific community for more and (continuously) updated reference works? And once available, are such works also consulted? Which parts? By whom? How often? For what purposes? In this paper we look at an ongoing, real world metalexicographic handbook project to answer these questions.
This paper presents the project “The first Romanian bilingual dictionaries (17th century). Digitally annotated and aligned corpus” (eRomLex) which deals with the editing of the first bilingual Romanian dictionaries. The aim of the project is to compile an electronic corpus comprising six Slavonic-Romanian lexicons dating from the 17th century, based on their relatedness and the fact that they follow a common model in order to highlight the characteristics of this lexicographical network (the affiliations between the lexicons, the way they relate to the source, the innovations towards it, their potential uses) and to facilitate the access to their content. A digital edition allows exhaustive data extraction and comparison and link with other digitized resources for old Romanian or Church Slavonic, including dictionaries. After presenting the corpus, we point to the necessary stages in achieving this project, the techniques used to access the material and the challenges and obstacles we encountered along the way. We describe how the corpus was created, stored, indexed and can be searched over; we will also present and discuss some statistical analyses highlighting relations between the Romanian lexicons and their Slavonic-Ruthenian source.
This think-aloud study charts the use of online resources by five final-year MA students in Nordic and Literacy Studies based on the analysis of screen and audio recordings of an error-correction task. The article briefly presents some linguistic features of Norwegian Nynorsk that are not common in the context of other European languages, that is, norm optionality with regards to inflection and spelling. While performing the task, the participants were allowed to use all digital aids. This article examines their resource consultation behavior, and it makes use of Laporte/Gilquin’s (2018) annotation protocol. The following research questions are posed: What online resources are used by the students? What characterizes the use? Are online resources helpful? This study provides new insights into an as yet little explored topic within the Norwegian context. The findings demonstrate that the participants relied heavily on the official monolingual dictionary Nynorskordboka. Indeed, the dictionary was helpful in the vast majority of the searches, either resulting in error improvement or the validation of a word; that is, many of the searches considered correct words. The findings suggest severe norm insecurity and emphasize the need to improve norm knowledge and metalinguistic knowledge as prerequisites for better utilization of aids. It is also suggested to include necessary information on norm optionality and other commonly queried issues in the dictionary architecture.
This paper describes a method for automatic identification of sentences in the Gigafida corpus containing multi-word expressions (MWEs) from the list of 5,242 phraseological units, which was developed on the basis of several existing open-access lexical resources for Slovene. The method is based on a definition of MWEs, which includes information on two levels of corpus annotation: syntax (dependency parsing) and morphology (POS tagging), together with some additional statistical parameters. The resulting lexicon contains 12,358 sentences containing MWEs extracted from the corpus. The extracted sentences were analysed from the lexicographic point of view with the aim of establishing canonical forms of MWEs and semantic relations between them in terms of variation, synonymy, and antonymy.
The paper presents the results of a survey on lexicographic practices and lexicographers’ needs across Europe that was conducted in the context of the Horizon 2020 project European Lexicographic Infrastructure (ELEXIS) among the observer institutions of the project. The survey is a revised and upgraded version of the survey which was originally conducted among ELEXIS lexicographic partner institutions in 2018 (Kallas et al. 2019a). The main goal of this new survey was to complement the data from the ELEXIS lexicographic partner institutions in order to get a more complete picture of lexicographic practices both for born-digital and retro-digitised resources in Europe. The results offer a detailed insight into many aspects of the lexicographic process at European institutions, such as funding, training, staff, lexicographic expertise, software and tools. In addition, the survey reflects on current trends in lexicography and reveals what institutions see as the most important emerging trends that will affect lexicography in the short-term and long-term future. Overall, the results provide valuable input informing the development of tools, resources, guidelines and training materials within ELEXIS.
Given the relevance of interoperability, born-digital lexicographic resources as well as legacy retro-digitised dictionaries have been using structured formats to encode their data, following guidelines such as the Text Encoding Initiative or the newest TEI Lex-0. While this new standard is being defined in a stricter approach than the original TEI dictionary schema, its reuse of element names for several types of annotation as well as the highly detailed structure makes it difficult for lexicographers to efficiently edit resources and focus on the real content. In this paper, we present the approach designed within LeXmart to facilitate the editing of TEI Lex-0 encoded resources, guaranteeing consistency through all editing processes.
The long road to a historical dictionary of Lower Sorbian. Towards a lexical information system
(2022)
The Sorbian Institute has been taking preparatory steps for a historical-documentary vocabulary information system for Lower Sorbian for about 10 years. To this end, the entire extant written material (16th–21st centuries) of this strongly endangered European minority language is to be systematically evaluated. An attempt made a few years ago to organise and finance the project as a long-term scientific project was not successful in the end. Therefore, it can only be advanced step by step and via some detours. The article informs about the interim status of the project, especially with respect to the creation of a reliable database.
This paper reports on the restructuring of a bilingual (Greek Sign Language, GSL – Modern Greek) lexicographic database with the use of the WordNet semantic and lexical database. The relevant research was carried out by the Institute for Language and Speech Processing (ILSP) / Athena R.C. team within the framework of the European project Easier. The project will produce a framework for intelligent machine translation to bring down language barriers among several spoken/written and sign languages. This paper describes the experience of the ILSP team to contribute to a multilingual repository of signs and their corresponding translations and to organize and enhance a bilingual dictionary (GSL – Modern Greek) as a result of this mapping; this will be the main focus of this paper. The methodology followed relies on the use of WordNet and, more specifically, the Open Multilingual WordNet (OMW) tool to map content in GSL to WordNet synsets.
There is a growing interest in pedagogical lexicography, and more specifically in the study of dictionary users’ abilities and strategies (Prichard 2008; Gavriilidou 2010, 2011; Gavriilidou/Mavrommatidou/Markos 2020; Gavriilidou/Konstantinidou 2021; Chatjipapa et al. 2020). Τhe purpose of this presentation is to investigate dictionary use strategy and the effect of an explicit and integrated dictionary awareness intervention program on upper elementary pupils’ dictionary use strategies according to gender and type of school. A total of 150 students from mainstream and intercultural schools, aged 10–12 years old, participated in the study. Data were collected before and after the intervention through the Strategy Inventory for Dictionary Use (SIDU) (Gavriilidou 2013). The results showed a significant effect of the intervention program on Dictionary Use Strategies employed by the experimental group and support the claim that increased dictionary use can be the outcome of explicit strategy instruction. In addition, the effective application of the program suggests that a direct and clear presentation of DUS is likely to be more successful than an implicit presentation. The present study contributes to the discussion concerning both the ‘teachability’ of dictionary use strategies and skills and the effective forms of intervention programs raising dictionary use awareness and culture.
Wortgeschichte digital (‘digital word history’) is a new historical dictionary of New High German, the most recent period of German reaching from approximately 1600 AD up to the present. By contrast to many historical dictionaries, Wortgeschichte digital has a narrated text – a “word history” – at the core of its entries. The motivation for choosing this format rather than traditional microstructures is
briefly outlined. Special emphasis it put on the way these word histories interact with other components of the dictionary, notably with the quotation section. As Wortgeschichte digital is an online only project, visualizations play an important role for the design of the dictionary. Two examples are presented: first, the “quotation navigator” which is relevant for the microstructure of the entries, and, second, a timeline (“Zeitstrahl”) which is part of the macrostructure as it gives access to the lemma inventory from a diachronic point of view.
This paper discusses an investigation of how senses are ordered across eight dictionaries. A dataset of 75 words was used for this purpose, and two senses were examined for each word. The words are divided into three groups of 25 words each according to the relationship between the senses: Homonymy, Metaphor, and Systematic Polysemy. The primary finding is that WordNet differs from the other dictionaries in terms of Metaphor. The order of the senses was more often figurative/literal, and it had the highest percentage of figurative senses that were not found. We discuss leveraging another dictionary, COBUILD, to re-order the senses according to frequency.
Head alignment in German compounds: Implications for prosodic constituency and morphological parsing
(2022)
The notion of head alignment was introduced to account for the observation that in a word with multiple feet, one is more prominent than the others. In particular, this notion is meant to capture the characteristic edge-orientation of main stress by requiring the (left or right) word boundary and the respective (left or right) boundary of the head foot to coincide (McCarthy & Prince 1993). In the present paper the notion of head alignment will be applied to compounds, which are also characterized by the property that one of their members, located in a margin position, is most prominent.
The adequacy of an analysis in terms of head alignment hinges on the question of whether observable prominence peaks associate with the boundaries of independently motivated constituents. It will be argued that such links exist for German compounds, indicating reference to at least three distinct compound categories established on morphological grounds: copulative, phrasal, and a default class of “regular” compounds. The evidence for the relevant distinctions sheds light on morphological parsing, indicating that compound categories can be – and often are – determined by properties pertaining to their complete form, rather than by conditions affecting their (original) construction.
The digital environment represents a qualitatively new level of service for research work with linguistic information presented in dictionary form. And first of all, this applies to index systems. By dictionary indexing we mean a set of formalized rules and procedures, on the basis of which it is possible to obtain information about certain linguistic facts recorded in the dictionary. These rules are implemented in the form of user interfaces. However, one should take into account the fact that the effectiveness of automatic construction of index schemes for a digital dictionary is possible only in a sufficiently formalized environment. This article describes the method and technology of indexing the Etymological Dictionary of the Ukrainian Language (EDUL). For the language indexing of the dictionary, a special computer instrumental system (VLL – virtual lexicographic laboratory) was developed, and adapted to the structure of the EDUL and focused on the creation of indexes in automatic mode. The digital implementation of the EDUL made it possible to access the entire corpus of the dictionary text regardless of the time of publication of the corresponding volume and opened up opportunities for various digital interpretations of etymological information.
Lexical data API
(2022)
This API provides data from various dictionary resources of K Dictionaries across 50 languages. It is used by language service providers, app developers, and researchers, and returns data as JSON documents. A basic search result consists of an object containing partial lexical information on entries that match the search criteria, but further in-depth information is also available. Basic search parameters include the source resource, source language, and text (lemma), and the entries are returned as objects within the results array. It is possible to look for words with specific syntactic criteria, specifying the part of speech, grammatical number, gender and subcategorization, monosemous or polysemous entries. When searching by parameters, each entry result contains a unique entry ID, and each sense has its own unique sense ID. Using these IDs, it is possible to obtain more data – such as syntactic and semantic information, multiword expressions, examples of usage, translations, etc. – of a single entry or sense. The software demonstration includes a brief overview of the API with practical examples of its operation.
Word Families in Diachrony. An epoch-spanning structure for the word families of older German
(2022)
The ‘Word Families in Diachrony’ project (WoDia), for which a funding application to the DFG is in preparation, aims to provide a database driven online research environment that will enable processes of change in the entire historical vocabulary of German to be investigated by focusing on the changes in word families and the individual means of word formation. WoDia will embed the vocabularies of Old High German (OHG), Middle High German (MHG), Old Saxon (OS), and Middle Low German (MLG) in a database, resulting in a word-family structure for High and Low German from the beginnings up to the 15th century (for High German) and up to the 17th century (for Low German). The basis of the vocabulary is provided by reference dictionaries of the four historical varieties, whereas the word families’ historical structure is based on the word-family dictionary of OHG by Jochen Splett (1992). Each lemma in the database will be assigned, where appropriate, to a word family. The individual word-formation elements and the word-formation hierarchy will be mapped in a structural formula. The etymologically corresponding lemmas and word families of the different periods/varieties of older German will be linked so that an analysis across the varieties will also be possible. The annotations of word families in the database (e. g., relating to word structure) will be supplemented by linking their lemmas to the online dictionaries and to the reference corpora of Old German (OS and OHG), MHG, and MLG.
The purpose of this paper is to present the lexicographic protocol and to report on the progress of compilation of Mikaela_Lex, which is a Greek, free online monolingual school dictionary for upper elementary students with visual impairments including 4,000 lemmata. The dictionary is equipped with new digital tools, such as the “Braille-system keyboard, a “speech-to-text” tool, a “text-to-speech” tool and also a qwerty accessibility for visually non-impaired students.
Looking up for an unknown word is the most frequent use of a dictionary. For languages both agglutinative and inflectional, such as Georgian, this can be quite challenging because an inflected form can be very far from the lemmas used by the target dictionary. In addition, there is no consensus among Georgian lexicographers on which lemmas represent a verb in dictionaries. It further complicates dictionaries access. Kartu-Verbs is a base of inflected forms of Georgian verbs accessible by a logical information system. It currently contains more than 5 million inflected forms related to more than 16,000 verbs for 11 tenses; each form can have 11 properties; there are more than 80 million links in the base. This demonstration shows how, from any inflected form, we can find the relevant lemma to access any dictionary. Kartu-Verbs can thus be used as a front-end to any Georgian dictionary.
We describe the status of work intending at including sign language lexical data within the OntoLex-Lemon framework. Our general goal is to provide for a multimodal extension to this framework, which was originally conceived for covering only the written and phonetic representation of lexical data. Our aim is to achieve in the longer term the same type of semantic interoperability between sign language lexical data as this is achieved for their spoken or written counterparts. We want also to achieve this goal across modalities: between sign language lexical data and spoken/written lexical data.
Lexicographers working with minority languages face many challenges. When the language in question is also a sign language, circumstances specific to the visual-spatial modality have to be taken into consideration as well. In this paper, we aim to show and discuss which challenges we encounter while compiling the Digitales Wörterbuch der Deutschen Gebärdensprache (DW-DGS), the first corpus-based dictionary of German Sign Language (DGS). Some parallel the challenges minority language lexicographers of spoken languages encounter, e. g. few resources, no written tradition, and having to create one dictionary for all potential user groups, while others are specific to sign languages, e. g. representation of visual-spatial language and creating access structures for the dictionary.
This paper consists of a short analysis of the sources and the treatment of the legal lexicon in the first dictionary published by the Spanish Royal Academy (1726–1739), followed by a longer commentary on the representation and the treatment of the concept of judge, in which the reflection of the extralinguistic factors in the definitions stands in focus. The results highlight the relevance of the legal context of that era for the treatment of the lexicon related to the legal domain, but they also demonstrate the pattern in which the lexicographic data displays peculiarities of legal matters.
Basnage’s revision (1701) of Furetiere’s Dictionnaire universel is profoundly different from Furetiere’s work in several regards. One of the most noticeable features of the dictionary lies in his in- creased use of usage labels. Although Furetiere already made use of usage labels (see Rey 1990), Basnage gives them a prominent role. As he states in the preface to his edition, a dictionary that aspires to the title of “universal” should teach how to speak in a polite way (“poliment”), right (“juste”) and making use of specific terminology for each art. He specifies, lemma by lemma, the diaphasic dimension by indicating the word’s register and context of use, the diastratic one by noting the differences in the use of the language within the social strata, the diachronic evolution by indicating both archaisms and neologisms, the diame- sic aspect by highlighting the gaps between oral and written language, the diatopic one by specifying either foreign borrowings or regionalisms.
After extracting the entries containing formulas such as “ce mot est...”, “ce terme est...” and similar ones, we compare the number of entries and the type of information provided by the two lexicographers1. In this paper, we will focus on Basnage’s innovative contribution. Furthermore, we will try to identify the lexi- cographer’s sources, i. e. we will try to establish on which grammars, collections of linguistic remarks or contemporary dictionaries Basnage relies his judgements.
This paper examines a certain subset of the vocabulary of Modern Icelandic, namely those words that are labelled as ‘ancient’ in the Dictionary of Contemporary Icelandic (DCI). The words were analysed and grouped into two main categories, 1) Words with only ‘ancient’ sense(s) and 2) words that have modern as well as an obsolete older sense. Several subgroups were identified as well as some lexical characteristics. The words in question were then analysed in two other sources, the Dictionary of Old Norse Prose (ONP) and the Icelandic Gigaword Corpus (IGC). The results show that the words belong to several semantic domains that reflect the types of texts that have survived until modern times. Most of the words are robustly attested in Old Norse sources, although there are a few exceptions. Large majority of the words can be found in Modern Icelandic texts, but to a varying degree. Limits of the corpus material makes it difficult to analyse some of the words. The result indicate that the words labelled ‘ancient’ can be divided into three main groups: a) words that are poorly attested and should perhaps not be included in the lexicographic description of Modern Icelandic; b) words that are likely to occur sometimes in Modern Icelandic; c) words that function as other inherited Old Norse words and perhaps do not require a special label or should have an additional sense in the DCI.
An ongoing academic and research program, the “Vocabula Grammatica” lexicon, implemented by the Centre for the Greek Language (Thessaloniki, Greece), aims at lemmatizing all the philological, grammatical, rhetorical, and metrical terms in the written texts of scholars (philologists and scholiasts) who curated the ancient Greek literature from the beginning of the Hellenistic period (4th/3rd c. BC) until the end of the Byzantine era (15th c. AD). In particular, it aspires to fill serious gaps (a) in the study of ancient Greek scholarship and (b) in the lexicography of the ancient Greek language and literature. By providing specific examples, we will highlight the typical and methodological features of the forthcoming dictionary.
This paper focusss on the first Slavonic-Romanian lexicons, compiled in the second half of the 17th century and their use(rs), proposing a method of investigating the manner in which lexical information available in the above corpus relates, if at all, to the vocabulary of texts from the same period. We chose to investigate their relation to an anonymous Old Testament translation made from Church Slavonic, also from the second half of the 17th century, which was supposed to be produced in the same geographical area, in the same Church Slavonic school or even by the same author as the lexicons. After applying a lemmatizer on both the Biblical text (Books of Genesis and Daniel) and the Romanian material from the lexicons, we analyse the results and double the statistical analysis with a series of case studies, focusing on some common lexemes that might be an indicator of the relatedness of the texts. Even if the analysis points out that the lexicons might not have been compiled as a tool for the translation of religious texts, it proves to be a useful method that reveals interesting data and provides the basis for more extensive approaches.
This paper aims at verifying if the most important online Brazilian Portuguese dictionaries include some of the neologisms identified in texts published in the 1990s to 2000s, formed with the elements ciber-, e-, bio-, eco- and narco, which we refer to as fractomorphemes / fracto-morphèmes. Three online dictionaries were analyzed (Aulete, Houaiss and Michaelis), as well as Vocabulário Ortográfico da Língua Portuguesa (VOLP). We were able to conclude that all three dictionaries and VOLP include neologisms with these elements; Michaelis and VOLP do not include separate entries for bound morphemes, whereas Houaiss includes entries for all of them and Aulete includes entries for bio-, eco- and narco-. Aulete also describes the neological meaning of eco- and narco-, whereas Houaiss does not.
In the etymological information for a word in a dictionary, the first question to be answered is whether the word is a borrowing or the result of word formation. Here, we consider this question for internationalisms ending in -ation in German and in -ácia in Slovak. In German, -ation is a suffix that attaches to verbs in -ieren. For these verbs, it is in competition with -ung. In Slovak, -ácia is a suffix that attaches to bases of Latin or Greek origin. The corresponding verbs are often backformations. Most Slovak verbs also have a nominalization in -nie. In order to investigate to what extent the nouns in -ation or -ácia are borrowings or derived from the corresponding verbs in German and Slovak, we took a random sample of English nouns in -ation for which OED gives a corresponding verb. For this sample, we checked whether the cognate noun in -ation or -ácia is attested in standard dictionaries and in corpora. Then we did the same for the corresponding verbs and the nouns in -ung or -nie. Finally, we checked the frequency of these words in DeReKo for German and SNK for Slovak. On this basis, we found evidence that -ation in German has a slightly different status to -ácia in Slovak. This status affects the relationship to the corresponding verbs and to the nouns in -ung or -nie. Such generalizations are important as background information for specifying etymological information in dictionaries, especially for languages where first attestations dates are not readily available.
Thesauri have long been recognized as valuable structured resources aiding Information Retrieval systems. A thesaurus provides a precise and controlled vocabulary which serves to coordinate data indexing and retrieval. The paper presents a bilingual Greek and English specialized thesaurus that is being developed as the backbone of a platform aimed at enhancing and enriching the cultural experiences of visitors in Eastern Macedonia and Thrace, Greece. The cultural component of the intended platform comprises textual data, images of artifacts and living entities (animals and plants in the area), as well as audio and video. The thesaurus covers the domains of Archaeology, Literature, Mythology, and Travel; therefore, it can be viewed as a set of inter-linked thesauri. Where applicable, terms and names in the database are also geo-referenced.
This paper looks at whether, after two decades of corpus building for the Bantu languages, the time is ripe to begin using monitor corpora. As a proof-of-concept, the usefulness of a Lusoga monitor corpus for lexicographic purposes, in casu for the detection of neologisms, both in terms of new words and new meanings, is investigated and found useful.
This study examines a list of 3,413 neologisms containing one or more borrowed item, which was compiled using the databases built by the Korean Neologism Investigation Project. Etymological aspects and morphological aspects are taken into consideration to show that, besides the overwhelming prevalence of English-based neologisms, particular loans from particular languages play a significant role in the prolific formation of Korean neologisms. Aspects of the lexicographic inclusion of loan-based neologisms demonstrate the need for Korean neologism and lexicography research to broaden its scopes in terms of methodology and attitudes, while also providing a glimpse of changes.
The EMLex Dictionary of Lexicography (= EMLexDictoL) is a plurilingual subject field dictionary (in German, English, Afrikaans, Galician, Italian, Polish and Spanish) that contains the basic subject field terminology of lexicography and dictionary research, in which the dictionary article texts are presented in a sophisticated but comprehensible form. The articles are supplemented by a complex crossreferencing system and the current subject field literature of the respective national languages. Following the lemma position, the dictionary articles contain items regarding morphology, synonymy, the position of the definiens, additional explanations, the cross-reference position, the position for literature, the equivalent terms in the other six languages of the dictionary as well as the names of the authors.
Action ascription can be understood from two broad perspectives. On one view, it refers to the ways in which actions constitute categories by which members make sense of their world, and forms a key foundation for holding others accountable for their conduct. On another view, it refers to the ways in which we accountably respond to the actions of others, thereby accomplishing sequential versions of meaningful social experience. In short, action ascription can be understood as matter of categorisation of prior actions or responding in ways that are sequentially fitted to prior actions, or both. In this chapter, we review different theoretical approaches to action ascription that have developed in the field, as well as the key constituents and resources of action ascription that have been identified in conversation analytic research, before going on to discuss how action ascription can itself be considered a form of social action.
Action ascription is an emergent process of mutual displays of understanding. Usually, the kind of action that is ascribed to a prior turn by a next action remains implicit. Sometimes, however, actions are overtly ascribed, for example, when speakers expose the use of strategies. This happens particularly in conflictual interaction, such as public debates or mediation talks. In these interactional settings, one of the speakers’ goals is to discredit their opponents in front of other participants or an overhearing audience. This chapter investigates different types of overt strategy ascriptions in a public mediation: exposing the opponent’s use of rhetorical devices, exposing the opponent’s use of false premises, and exposing that an opponent is telling only a half-truth. This chapter shows how speakers use ascriptions of acting strategically as accusations to disclose their opponents’ intentions and ‘truths’ that the opponents allegedly conceal and that are detrimental to their position.
Meta-communicative practices are generally reflexive in a fairly obvious sense: Inasmuch as speakers use them to talk about or comment on earlier/subsequent talk, they use language self-reflexively. In this paper, we explore a practice that is reflexive not only in this meta-communicative sense but also in a sequential-interactional one: Prefacing a conversational turn with I was gonna say. We show that the I was gonna say-preface furnishes the following general semantic-pragmatic affordances: (1) It retroactively relates the speaker’s subsequent talk to preceding talk from a co-participant, (2) it embodies a claim to prior, now-preempted, communicative intent with regard to what their co-participant has (just) said/done, (3) it therefore displays its speaker’s orientation to the relevance or the appropriate placement of the action(s) done in their own subsequent talk at an earlier moment in the interaction, and (4) it reflexively re-invokes, or retrieves, this earlier moment as the relevant sequential context for their action(s). We then go on to illustrate how speakers draw on these sequentially reflexive affordances for managing recurrent interactional contingencies in specific sequential environments. The paper ends with a discussion of the role that reflexivity plays in and for the deployment of this practice.
Dictionaries have been part and parcel of literate societies for many centuries. They assist in communication, particularly across different languages, to aid in understanding, creating, and translating texts. Communication problems arise whenever a native speaker of one language comes into contact with a speaker of another language. At the same time, English has established itself as a lingua franca of international communication. This marked tendency gives lexicography of English a particular significance, as English dictionaries are used intensively and extensively by huge numbers of people worldwide.
The normative layer of CLARIN is, alongside the organizational and technical layers, an essential part of the infrastructure. It consists of the regulatory framework (statutory law, case law, authoritative guidelines, etc.), the contractual framework (licenses, terms of service, etc.), and ethical norms. Navigating the normative layer requires expertise, experience, and qualified effort. In order to advise the Board of Directors, a standing committee dedicated to legal and ethical issues, the CLIC, was created. Since its establishment in 2012, the CLIC has made considerable efforts to provide not only the BoD but also the general public with information and guidance. It has published many articles (both in proceedings of CLARIN conferences and in its own White Paper Series) and developed several LegalTech tools. It also runs a Legal Information Platform, where accessible information on various issues affecting language resources can be found.
Dieses Kapitel lotet Möglichkeiten und Methoden aus, digitale Diskursanalysen nationalsozialistischer Quellentexte durchzuführen. Digitale Technologie wird dabei als heuristisches Werkzeug betrachtet, mit dem der Sprachgebrauch während des Nationalsozialismus im Rahmen größerer Quellenkorpora untersucht werden kann. In einem theoretischen Abschnitt wird grundsätzlich dafür plädiert, während des Analyseprozesses hermeneutisches Sinnverstehen mit breitflächigen korpusbasierten Abfragen zu kombinieren. Verdeutlicht wird diese Herangehensweise an zwei empirischen Beispielen: Anhand eines Korpus von Hitler- und Goebbels-Reden wird dem Auftauchen und der diskursiven Ausgestaltung des nationalsozialistischen Konzepts „Lebensraum“ nachgespürt. Schritt für Schritt wird offengelegt, welche Analysewege durch das Abfragen von Schlüsseltexten, Keywords, Konkordanzen und Kollokationen verfolgt werden können. Das zweite Beispiel zeigt anhand von Eingaben, die aus der Bevölkerung an Staats- und Parteiinstanzen gerichtet wurden, wie solche Quellen mithilfe eines digitalen Tools manuell annotiert werden können, um sie danach auf Musterhaftigkeiten im Sprachgebrauch hin auswerten zu können.
This paper presents a compositional annotation scheme to capture the clusivity properties of personal pronouns in context, that is their ability to construct and manage in-groups and out-groups by including/excluding the audience and/or non-speech act participants in reference to groups that also include the speaker. We apply and test our schema on pronoun instances in speeches taken from the German parliament. The speeches cover a time period from 2017-2021 and comprise manual annotations for 3,126 sentences. We achieve high inter-annotator agreement for our new schema, with a Cohen’s κ in the range of 89.7-93.2 and a percentage agreement of > 96%. Our exploratory analysis of in/exclusive pronoun use in the parliamentary setting provides some face validity for our new schema. Finally, we present baseline experiments for automatically predicting clusivity in political debates, with promising results for many referential constellations, yielding an overall 84.9% micro F1 for all pronouns.
Standards in CLARIN
(2022)
This chapter looks at a fragment of the ongoing work of the CLARIN Standards Committee (CSC) on producing a shared set of recommendations on standards, formats, and related best practices supported by the CLARIN infrastructure and its participating centres. What might at first glance seem to be a straightforward goal has over the years proven to be rather complex, reflecting the robustness and heterogeneity of the emerging distributed digital research infrastructure and the various disciplines and research traditions of the language-based humanities that it serves and represents, and therefore part of the chapter reviews the various initiatives and proposals that strove to produce helpful standards-related guidance. The focus turns next to a subtask initiated in late 2019, its scope narrowed to one of the core activities and responsibilities of CLARIN backbone centres, namely the provision of data deposition services. Centres are obligated to publish their recom-mendations concerning the repertoire of data formats that are best suited for their research profiles. We look at how this requirement has been met by the particular centres and suggest that having centres maintain their information in the Standards Information System (SIS) is the way to improve on the current state of affairs.
This chapter will present lessons learned from CLARIN-D, the German CLARIN national consortium. Members of the CLARIN-D communities and of the CLARIN-D consortium have been engaged in innovative, data-driven, and community-based research, using language resources and tools in the humanities and neigh-bouring disciplines. We will present different use cases and users’ stories that demonstrate the innovative research potential of large digital corpora and lexical resources for the study of language change and variation, for language documentation, for literary studies, and for the social sciences. We will emphasize the added value of making language resources and tools available in the CLARIN distributed research infrastructure and will discuss legal and ethical issues that need to be addressed in the use of such an infrastructure. Innovative technical solutions for accessing digital materials still under copyright and for data mining such materials will be presented. We will outline the need for close interaction with communities of interest in the areas of curriculum development, data management, and training the next generation of digital humanities scholars. The importance of community-supported standards for encoding language resources and the practice of community-based quality control for digital research data will be presented as a crucial step toward the provisioning of high quality research data. The chapter will conclude with a discussion of impor-tant directions for innovative research and for supporting infrastructure development over the next decade and beyond.
CLARIN stands for “Common Language Resources and Technology Infrastructure”. In 2012 CLARIN ERIC was established as a legal entity with the mission to create and maintain a digital infrastructure to support the sharing, use, and sustainability of language data (in written, spoken, or multimodal form) available through repositories from all over Europe, in support of research in the humanities and social sciences and beyond. Since 2016 CLARIN has had the status of Landmark research infrastructure and currently it provides easy and sustainable access to digital language data and also offers advanced tools to discover, explore, exploit, annotate, analyse, or combine such datasets, wherever they are located. This is enabled through a networked federation of centres: language data repositories, service centres, and knowledge centres with single sign-on access for all members of the academic community in all participating countries. In addition, CLARIN offers open access facilities for other interested communities of use, both inside and outside of academia. Tools and data from different centres are interoperable, so that data collections can be combined and tools from different sources can be chained to perform operations at different levels of complexity. The strategic agenda adopted by CLARIN and the activities undertaken are rooted in a strong commitment to the Open Science paradigm and the FAIR data principles. This also enables CLARIN to express its added value for the European Research Area and to act as a key driver of innovation and contributor to the increasing number of industry programmes running on data-driven processes and the digitalization of society at large.
Enabling appropriate access to linguistic research data, both for many researchers and for innovative research applications, is a challenging task. In this chapter, we describe how we address this challenge in the context of the German Reference Corpus DeReKo and the corpus analysis platform KorAP. The core of our approach, which is based on and tightly integrated into the CLARIN infrastructure, is to offer access at different levels. The graduated access levels make it possible to find a low-loss compromise between the possibilities opened up and the costs incurred by users and providers for each individual use case, so that, viewed over many applications, the ratio between effort and results achieved can be effectively optimized. We also report on experiences with the current state of this approach.
Preface
(2022)
Ethical issues in Language Resources and Language Technology are often invoked, but rarely discussed. This is at least partly because little work has been done to systematize ethical issues and principles applicable in the fields of Language Resources and Language Technology. This paper provides an overview of ethical issues that arise at different stages of Language Resources and Language Technology development, from the conception phase through the construction phase to the use phase. Based on this overview, the authors propose a tentative taxonomy of ethical issues in Language Resources and Language Technology, built around five principles: Privacy, Property, Equality, Transparency and Freedom. The authors hope that this tentative taxonomy will facilitate ethical assessment of projects in the field of Language Resources and Language Technology, and structure the discussion on ethical issues in this domain, which may eventually lead to the adoption of a universally accepted Code of Ethics of the Language Resources and Language Technology community.
In this paper, we address two problems in indexing and querying spoken language corpora with overlapping speaker contributions. First, we look into how token distance and token precedence can be measured when multiple primary data streams are available and when transcriptions happen to be tokenized, but are not synchronized with the sound at the level of individual tokens. We propose and experiment with a speaker based search mode that enables any speaker’s transcription tier to be the basic tokenization layer whereby the contributions of other speakers are mapped to this given tier. Secondly, we address two distinct methods of how speaker overlaps can be captured in the TEI based ISO Standard for Spoken Language Transcriptions (ISO 24624:2016) and how they can be queried by MTAS – an open source Lucene-based search engine for querying text with multilevel annotations. We illustrate the problems, introduce possible solutions and discuss their benefits and drawbacks.