Refine
Year of publication
- 2019 (106) (remove)
Document Type
- Article (37)
- Conference Proceeding (34)
- Part of a Book (25)
- Book (5)
- Other (2)
- Part of Periodical (1)
- Review (1)
- Working Paper (1)
Language
- English (106) (remove)
Keywords
- Korpus <Linguistik> (31)
- Deutsch (29)
- Automatische Sprachanalyse (12)
- Gesprochene Sprache (10)
- Computerlinguistik (8)
- corpus linguistics (8)
- Konversationsanalyse (7)
- Sprachpolitik (7)
- Annotation (6)
- Englisch (6)
Publicationstate
- Veröffentlichungsversion (57)
- Zweitveröffentlichung (38)
- Postprint (21)
Reviewstate
- Peer-Review (77)
- (Verlags)-Lektorat (18)
- (Verlags-)Lektorat (1)
- Peer review (1)
- Peer-review (1)
Publisher
- de Gruyter (10)
- Leibniz-Institut für Deutsche Sprache (8)
- German Society for Computational Linguistics & Language Technology und Friedrich-Alexander-Universität Erlangen-Nürnberg (7)
- Lexical Computing CZ s.r.o. (6)
- Editura Academiei Române (5)
- Springer (5)
- Elsevier (4)
- Palgrave Macmillan (4)
- The Association for Computational Linguistics (4)
- Routledge, Taylor & Francis Group (3)
This edited collection provides an overview of linguistic diversity, societal discourses and interaction between majorities and minorities in the Baltic States. It presents a wide range of methods and research paradigms including folk linguistics, discourse analysis, narrative analyses, code alternation, ethnographic observations, language learning motivation, languages in education and language acquisition. Grouped thematically, its chapters examine regional varieties and minority languages (Latgalian, Võro, urban dialects in Lithuania, Polish in Lithuania); the integration of the Russian language and its speakers; and the role of international languages like English in Baltic societies. The editors’ introductory and concluding chapters provide a comparative perspective that situates these issues within the particular history of the region and broader debates on language and nationalism at a time of both increased globalization and ethno-regionalism. This book will appeal in particular to students and scholars of multilingualism, sociolinguistics, language discourses and language policy, and provide a valuable resource for researchers focusing on Baltic States, Northern Europe and the post-Soviet world in the related fields of history, political science, sociology and anthropology.
Both compounds and multi-word expressions are complex lexical units, made up of at least two constituents. The most basic difference is that the former are morphological objects and the latter result from syntactic processes. However, the exact demarcation between compounds and multi-word expressions differs greatly from language to language and is often a matter of debate in and across languages. Similarly debated is whether and how these two different kinds of units complement or compete with each other.
The volume presents an overview of compounds and multi-word expressions in a variety of European languages. Central questions that are discussed for each language concern the formal distinction between compounds and multi-word expressions, their formation and their status in lexicon and grammar.
The volume contains chapters on German, English, Dutch, French, Italian, Spanish, Greek, Russian, Polish, Finnish, and Hungarian as well as a contrastive overview with a focus on German. It brings together insights from word-formation theory, phraseology and theory of grammar and aims to contribute to the understanding of the lexicon, both from a language-specific and cross-linguistic perspective.
Common Crawl is a considerably large, heterogeneous multilingual corpus comprised of crawled documents from the internet, surpassing 20TB of data and distributed as a set of more than 50 thousand plain text files where each contains many documents written in a wide variety of languages. Even though each document has a metadata block associated to it, this data lacks any information about the language in which each document is written, making it extremely difficult to use Common Crawl for monolingual applications. We propose a general, highly parallel, multithreaded pipeline to clean and classify Common Crawl by language; we specifically design it so that it runs efficiently on medium to low resource infrastructures where I/O speeds are the main constraint. We develop the pipeline so that it can be easily reapplied to any kind of heterogeneous corpus and so that it can be parameterised to a wide range of infrastructures. We also distribute a 6.3TB version of Common Crawl, filtered, classified by language, shuffled at line level in order to avoid copyright issues, and ready to be used for NLP applications.
Text corpora come in many different shapes and sizes and carry heterogeneous annotations, depending on their purpose and design. The true benefit of corpora is rooted in their annotation and the method by which this data is encoded is an important factor in their interoperability. We have accumulated a large collection of multilingual and parallel corpora and encoded it in a unified format which is compatible with a broad range of NLP tools and corpus linguistic applications. In this paper, we present our corpus collection and describe a data model and the extensions to the popular CoNLL-U format that enable us to encode it.
As the Web ought to be considered as a series of sources rather than as a source in itself, a problem facing corpus construction resides in meta-information and categorization. In addition, we need focused data to shed light on particular subfields of the digital public sphere. Blogs are relevant to that end, especially if the resulting web texts can be extracted along with metadata and made available in coherent and clearly describable collections.
Nearly all of the very large corpora of English are “static”, which allows a wide range of one-time, pre-processed data, such as collocates. The challenge comes with large “dynamic” corpora, which are updated regularly, and where preprocessing is much more difficult. This paper provides an overview of the NOW corpus (News on the Web), which is currently 8.2 billion words in size, and which grows by about 170 million words each month. We discuss the architecture of NOW, and provide many examples that show how data from NOW can (uniquely) be extracted to look at a wide range of ongoing changes in English.
The present research unites two emergent trends in the area of language attitudes: (a) research on perceptions of nonnative speakers by nonnative listeners and (b) the search for general, basic mechanisms underlying the evaluation of nonnative accented speakers. In three experiments featuring an employment situation, German participants listened to a presentation given in English by a German speaker with a strong versus native-like accent (in Studies 1–3) versus a native speaker of English (in Study 1). They evaluated candidates with a strong accent worse than candidates with a native(-like) pronunciation—even to the degree that the quality of arguments was of no relevance (Study 1). Study 2 introduces an effective intervention to reduce these discriminatory tendencies. Across studies, affect and competence emerged as major mediators of hirability evaluations. Study 3 further revealed sequential indirect influences, which advance our understanding of previous inconsistent findings regarding disfluency and warmth perceptions.
We report on a new project building a Natural Language Processing resource for Zulu by making use of resources already available. Combining tagging results with the results of morphological analysis semi-automatically, we expect to reduce the amount of manual work when generating a finely-grained gold standard corpus usable for training a tagger. From the tagged corpus, we plan to extract verb-argument pairs with the aim of compiling a verb valency lexicon for Zulu.
Research on language politics, policy, and planning is of importance to contact linguistics, since political relations between groups of language users, the way in which the use of language(s) is organized, and how language issues are politicized fundamentally shape the political and social conditions under which language varieties are in contact. This chapter first provides a short sketch of how language policy, planning, and politics have so far been conceptualized. Major subfields will be discussed, and then relevant actors and factors in these processes will be introduced. At the end, these aspects will be discussed from a contact linguistic perspective and summarized in a graphic visualization.
Studies on the Linguistic Landscapes (LLs) investigate frequencies, functions, and power relations between languages and their speakers in public space. Research on the LL thereby aims to understand how the production and perception of signs reflect and simultaneously shape realities. In this sense, the LL is one of the most dynamic places where processes of minoritization take place: the (in)visibility of minority languages and the functional and symbolic relationships to majority languages are in direct relationship with negotiations of minorities’ place in society. This chapter looks at minority languages in the LL from two major perspectives. Firstly, it discusses language policies, focussing on which policy categories and which domains of language use are of particular relevance for understanding minority languages in the LL. Then, it turns to issues of conflict, contestation, and exclusion by providing examples from a range of geographically and typologically prototypical case studies, including Israel, Canada, Belgium, the Basque Country, and Friesland.
Resistance and adaptation to newspeakerness in educational institutions: two tales from Estonia
(2019)
The term ‘new speaker’ has recently emerged as an attempt by sociolinguists not only to understand the diferent types of speaker profles that can be found in contemporary societies, but also to grasp the underlying processes of becoming a legitimate speaker in a given society. In this article, we combine the results from two studies situated in two educational institutions in Estonia in order to fnd out about speakers’ language attitudes and experiences in connection to learning and using Estonian. We concentrate on members of the international community who have relatively recently arrived to the country. Our results indicate that these speakers fuctuate between two prototypical discourses, which we broadly dub as ‘resistance’ and ‘adaptation’ to newspeakerness. Our study thereby adds to current debates on ‘new speaker’ and language policy issues by illustrating how tensions around language legitimacy are played out on the ground in a small nation state such as Estonia.
This chapter investigates differences in language regards in Latvia and Estonia. Based on the results of a survey that had about 1000 respondents in each country, it analyses general views on languages and language-learning motivation, as well as specific regards of Estonian, Latvian, Russian, English, German and other languages. The results show that languages and language learning are generally important for the respondents; language-learning motivation is overwhelmingly instrumental. Besides the obvious value of the titular languages of each country, English and Russian are to differing degrees considered of importance for professional and leisure purposes, ahead of German, Finnish (in Estonia) and French, whereas other languages are of little relevance. In more emotionally related categories, differences are more salient. L1-speakers of Russian differ in their views from L1-speakers of Estonian and Latvian, indicating that the linguistic acculturation of society in Estonia tends to be more monodirectional towards Estonian, whereas in Latvia there are more bidirectional tendencies as both Latvian and Russian L1-speakers regard each other’s languages as at least moderately relevant.
This chapter introduces readers to the context and concept of this volume. It starts by providing an historical overview of languages and multilingualism in Lithuania, Estonia and Latvia, highlighting the 100th anniversary of statehood which the three Baltic states are celebrating in 2018. Then, the chapter briefly presents important strands of research on multilingualism in the region throughout the past decades; in particular, questions about language policies and the status of the national languages (Estonian, Latvian and Lithuanian) and Russian. It also touches on debates about languages in education and the roles of other languages such as the regional languages of Latgalian and Võro and the changing roles of international languages such as English and German. The chapter concludes by providing short summaries of the contributions to this book.
This paper analyzes the LL in the city of Bautzen / Budyšin in Germany, a town which is frequently considered the “capital” of the Slavonic minority of the Sorbs. It focuses on the societal role of Sorbian in relation to practices and ideologies of mainstream German society. The vast majority of signs in Bautzen / Budyšin are in German only. Sorbian is essentially restricted to explicitly Sorbian institutions and to local and regional administration. Interviews conducted in shops and on the streets reveal that paternalistic attitudes common to perceptions of language policies and minority languages in Germany dominate; practices maintain the common monolingual habitus in German society. Members of the majority population show little awareness of Sorbian issues, and Sorbian signage is seen as a generous gesture but considered essentially unnecessary. Only in most recent times, a reaction by the Sorbian community has challenged these practices and attitudes.
Automatic division of spoken language transcripts into sentence-like units is a challenging problem, caused by disfluencies, ungrammatical structures and the lack of punctuation. We present experiments on dividing up German spoken dialogues where we investigate the impact of task setup and data representation, encoding of context information as well as different model architectures for this task.
We discuss the impact of data bias on abusive language detection. We show that classification scores on popular datasets reported in previous work are much lower under realistic settings in which this bias is reduced. Such biases are most notably observed on datasets that are created by focused sampling instead of random sampling. Datasets with a higher proportion of implicit abuse are more affected than datasets with a lower proportion.
We examine the new task of detecting derogatory compounds (e.g. curry muncher). Derogatory compounds are much more difficult to detect than derogatory unigrams (e.g. idiot) since they are more sparsely represented in lexical resources previously found effective for this task (e.g. Wiktionary). We propose an unsupervised classification approach that incorporates linguistic properties of compounds. It mostly depends on a simple distributional representation. We compare our approach against previously established methods proposed for extracting derogatory unigrams.
This paper reports on the latest developments of the European Reference Corpus EuReCo and the German Reference Corpus in relation to three of the most important CMLC topics: interoperability, collaboration on corpus infrastructure building, and legal issues. Concerning interoperability, we present new ways to access DeReKo via KorAP on the API and on the plugin level. In addition we report about advancements in the EuReCo- and ICC-initiatives with the provision of comparable corpora, and about recent problems with license acquisitions and our solution approaches using an indemnification clause and model licenses that include scientific exploitation.
Distributional models of word use constitute an indispensable tool in corpus based lexicological research for discovering paradigmatic relations and syntagmatic patterns (Belica et al. 2010). Recently, word embeddings (Mikolov et al. 2013) have revived the field by allowing to construct and analyze distributional models on very large corpora. This is accomplished by reducing the very high dimensionality of word cooccurrence contexts, the size of the vocabulary, to few dimensions, such as 100-200. However, word use and meaning can vary widely along dimensions such as domain, register, and time, and word embeddings tend to represent only the most prevalent meaning. In this paper we thus construct domain specific word embeddings to allow for systematically analyzing variations in word use. Moreover, we also demonstrate how to reconstruct domain specific co-occurrence contexts from the dense word embeddings.
Since 2013 representatives of several French and German CMC corpus projects have developed three customizations of the TEI-P5 standard for text encoding in order to adapt the encoding schema and models provided by the TEI to the structural peculiarities of CMC discourse. Based on the three schema versions, a 4th version has been created which takes into account the experiences from encoding our corpora and which is specifically designed for the submission of a feature request to the TEI council. On our poster we would present the structure of this schema and its relations (commonalities and differences) to the previous schemas.
This paper presents the prototype of a lexicographic resource for spoken German in interaction, which was conceived within the framework of the LeGeDe-project (LeGeDe=Lexik des gesprochenen Deutsch). First of all, it summarizes the theoretical and methodological approaches that were used for the initial planning of the resource. The headword candidates were selected by analyzing corpus-based data. Therefore, the data of two corpora (written and spoken German) were compared with quantitative methods. The information that was gathered on the selected headword candidates can be assigned to two different sections: meanings and functions in interaction.
Additionally, two studies on the expectations of future users towards the resource were carried out. The results of these two studies were also taken into account in the development of the prototype. Focusing on the presentation of the resource’s content, the paper shows both the different lexicographical information in selected dictionary entries, and the information offered by the provided hyperlinks and external texts. As a conclusion, it summarizes the most important innovative aspects that were specifically developed for the implementation of such a resource.
In this paper, we present WebAnno-MM, an extension of the popular web-based annotation tool WebAnno, which is designed for the linguistic annotation of transcribed spoken data with time aligned media files. Several new features have been implemented for our current use case: a novel teaching method based on pair-wise manual annotation of transcribed video data and systematic comparison of agreement between students. To enable the annotation of transcribed spoken language data, apart from technical and data model related challenges, WebAnno-MM offers an additional view to data: a (musical) score view for the inspection of parallel utterances, which is relevant for various methodological research questions regarding the analysis of interactions of spoken content.
In this paper, we describe a data processing pipeline used for annotated spoken corpora of Uralic languages created in the INEL (Indigenous Northern Eurasian Languages) project. With this processing pipeline we convert the data into a loss-less standard format (ISO/TEI) for long-term preservation while simultaneously enabling a powerful search in this version of the data. For each corpus, the input we are working with is a set of files in EXMARaLDA XML format, which contain transcriptions, multimedia alignment, morpheme segmentation and other kinds of annotation. The first step of processing is the conversion of the data into a certain subset of TEI following the ISO standard ’Transcription of spoken language’ with the help of an XSL transformation. The primary purpose of this step is to obtain a representation of our data in a standard format, which will ensure its long-term accessibility. The second step is the conversion of the ISO/TEI files to a JSON format used by the “Tsakorpus” search platform. This step allows us to make the corpora available through a web-based search interface. As an addition, the existence of such a converter allows other spoken corpora with ISO/TEI annotation to be made accessible online in the future.
The Lehnwortportal Deutsch (2012 seqq.) serves as an integrated online information system on German lexical borrowings into other languages, synthesizing an increasing number of lexicographical dictionaries and providing basic cross-resource search options. The paper discusses the far-reaching revision of the system’s conceptual, lexicographical and technological underpinnings currently under way, focussing on their relevance for multilingual loanword lexicography.
A "polyglottal" speech synthesis - modifications for a replica of Kempelen's speaking machine
(2019)
Smooth turn-taking in conversation depends in part on speakers being able to communicate their intention to hold or cede the floor. Both prosodic and gestural cues have been shown to be used in this context. We investigate the interplay of pitch movements and hand gestures at locations at which speaker change becomes relevant, comparing their use in German and Swedish. We find that there are some shared functions of prosody and gesture with regard to turn-taking in the two languages, but that these shared functions appear to be mediated by the different phonological demands on pitch in the two languages.
This paper describes a rule-based approach to detect direct speech without the help of any quotation markers. As datasets fictional and non-fictional texts were used. Our evaluation shows that the results appear stable throughout different datasets in the fictional domain and are comparable to the results achieved in related work.
This paper presents types and annotation layers of reply relations in computer- mediated communication (CMC). Reply relations hold between post units in CMC interactions and describe references from one given post to a previous post. We classify three types of reply relations in CMC interactions: first, technical replies, i. e. the possibility to reply directly to a previous post by clicking a ‘reply’ button; second, indentations, e. g. in wiki talk pages in which users insert their contributions in the existing talk page by indenting them and third, interpretative reply relations, i. e. the reply action is not realised formally but signalled by other structural or linguistics means such as address markers ‘@’, greetings, citations and/or Q-A structures. We take a look at existing practices in the description and representation of such relations in corpora and examples of chat, Wikipedia talk pages, Twitter and blogs. We then provide an annotation proposal that combines the different levels of description and representation of reply relations and which adheres to the schemas and practices for encoding CMC corpus documents within the TEI framework as defined by the TEI CMC SIG. It constitutes a prerequisite for correctly identifying higher levels of interactional relations such as dialogue acts or discussion trees.
Classical null hypothesis significance tests are not appropriate in corpus linguistics, because the randomness assumption underlying these testing procedures is not fulfilled. Nevertheless, there are numerous scenarios where it would be beneficial to have some kind of test in order to judge the relevance of a result (e.g. a difference between two corpora) by answering the question whether the attribute of interest is pronounced enough to warrant the conclusion that it is substantial and not due to chance. In this paper, I outline such a test.
This paper discusses German neologisms in the so-called “new-media” and presents a German corpus-based online dictionary of neologisms. Several neological morphemes and lexemes, as well as their meaning will be presented, showing that these new modes of communication are an important source of enrichment of German lexicon.
In the first volume of Corpus Linguistics and Linguistic Theory, Gries (2005. Null-hypothesis significance testing of word frequencies: A follow-up on Kilgarriff. Corpus Linguistics and Linguistic Theory 1(2). doi:10.1515/cllt.2005.1.2.277. http://www.degruyter.com/view//cllt.2005.1.issue-2/cllt.2005.1.2.277/cllt.2005.1.2.277.xml: 285) asked whether corpus linguists should abandon null-hypothesis significance testing. In this paper, I want to revive this discussion by defending the argument that the assumptions that allow inferences about a given population – in this case about the studied languages – based on results observed in a sample – in this case a collection of naturally occurring language data – are not fulfilled. As a consequence, corpus linguists should indeed abandon null-hypothesis significance testing.
The recognizability of a stretch of conduct as social action depends on details of turn construction as well as the turn’s context. We examine details of turn construction as they enter into actions offering interpretations of prior talk. Such actions either initiate repair or formulate a conclusion from prior talk. We focus on how interpretation markers (das heißt [“that means”] vs. du meinst [“you mean”]) and interpretation formats (phrasal vs. clausal turn completions) each make their invariant contribution to specific interpreting practices. Interpretation marker and turn format go hand in hand, which leads to distinct patterns of interpreting practices: Das heißt+clause is especially apt for formulations, du meinst+phrase for repair. The results suggest that details of turn construction can systematically enter into the constitution of social action. Data are in German with English translation.
Little strokes fell great oaks. Creating CoRoLa, the reference corpus of contemporary Romanian
(2019)
The paper presents the quite long-standing tradition of Romanian corpus acquisition and processing, which reaches its peak with the reference corpus of contemporary Romanian language (CoRoLa). The paper describes decisions behind the kinds of texts collected, as well as processing and annotation steps, highlighting the structure and importance of metadata to the corpus. The reader is also introduced to the three ways in which (s)he can plunge into the rich linguistic data of the corpus, waiting to be discovered. Besides querying the corpus, word embeddings extracted from it are useful to various natural language processing applications and for linguists, when user-friendly interfaces offer them the possibility to exploit the data.
Canadian heritage German across three generations: A diary-based study of language shift in action
(2019)
It is well known that migration has an effect on language use and language choice. If the language of origin is maintained after migration, it tends to change in the new contact setting. Often, migrants shift to the new majority language within few generations. The current paper examines a diary corpus containing data from three generations of one German-Canadian family, ranging from 1867 to 1909, and covering the second to fourth generation after immigration. The paper analyzes changes that can be observed between the generations, with respect to the language system as well as to the individuals’ decision on language choice. The data not only offer insight into the dynamics of acquiring a written register of a heritage language, and the eventual shift to the majority language. They also allow us to identify different linguistic profiles of heritage speakers within one community. It is discussed how these profiles can be linked to the individuals’ family backgrounds and how the combination of these backgrounds may have contributed to giving up the heritage language in favor of the majority language.
Linguistic relativists have traditionally asked 'how language influences thought', but conversation analysts and anthropological linguists have moved the focus from thought to social action. We argue that 'social action' should in this context not become simply a new dependent variable, because the formulation 'does language influence action' suggests that social action would already be meaningfully constituted prior to its local (verbal and multi-modal) accomplishment. We draw on work by the gestalt psychologist Karl Duncker to show that close attention to action-in-a-situation helps us ground empirical work on cross-cultural diversity in an appreciation of the invariances that make culture-specific elements of practice meaningful.
Theories of lexical decomposition assume that lexical meanings are complex. This complexity is expressed in structured meaning representations that usually consist of predicates, arguments, operators, and other elements of propositional and predicate logic. Lexical decomposition has been used to explain phenomena such as argument linking, selectional restrictions, lexical-semantic relations, scope ambiguities, and the inference behavior of lexical items. The article sketches the early theoretical development from noun-oriented semantic feature theories to verb-oriented complex decompositions. It also deals with a number of theoretical issues, including the controversy between decompositional and atomistic approaches to meaning, the search for semantic primitives, the function of decompositions as definitions, problems concerning the interpretability of decompositions, and the debate about the cognitive status of decompositions.
Colonial studies
(2019)
The present paper examines a variety of ways in which the Corpus of Contemporary Romanian Language (CoRoLa) can be used. A multitude of examples intends to highlight a wide range of interrogation possibilities that CoRoLa opens for different types of users. The querying of CoRoLa displayed here is supported by the KorAP frontend, through the querying language Poliqarp. Interrogations address annotation layers, such as the lexical, morphological and, in the near future, the syntactical layer, as well as the metadata. Other issues discussed are how to build a virtual corpus, how to deal with errors, how to find expressions and how to identify expressions.
Novel formats of construction-based description hold great potential for phenomena that fall through the cracks in traditional kinds of linguistic reference works. On the example of German verb argument structure constructions with a prepositional object, we demonstrate that a construction-based description of such phenomena is superior to existing lexicographic and grammaticographic treatments, but that it also poses a number of new problems. The most fundamental of these relates to the fact that construction-based analyses can be proposed on different levels of abstraction. We illustrate pertinent problems relating to the precise identification of constructional form and meaning and suggest a multi-layered descriptive format for web-based electronic reference constructica that can accommodate these challenges. Semantically, the proposed solution integrates both lumping and splitting perspectives on constructional grain size and permits users to flexibly zoom in and out on individual elements in the resource. Formally, it can capture variation in the number and marking of realised arguments as found in e.g. passives and transitivity alternations. Aspects of the theoretical controversy between Construction Grammar and Valency Theory are addressed where relevant, but our focus is on questions of description and the practical implementation of construction-based analyses in a suitable type of linguistic reference work.
This article examines a recurrent format that speakers use for defining ordinary expressions or technical terms. Drawing on data from four different languages - Flemish, French, German, and Italian - it focuses on definitions in which a definiendum is first followed by a negative definitional component (‘definiendum is not X’), and then by a positive definitional component (‘definiendum is Y’). The analysis shows that by employing this format, speakers display sensitivity towards a potential meaning of the definiendum that recipients could have taken to be valid. By negating this meaning, speakers discard this possible, yet unintended understanding. The format serves three distinct interactional purposes: (a) it is used for argumentation, e.g. in discussions and political debates, (b) it works as a resource for imparting knowledge, e.g. in expert talk and instructions, and (c) it is employed, in ordinary conversation, for securing the addressee's correct understanding of a possibly problematic expression. The findings contribute to our understanding of how epistemic claims and displays relate to the turn-constructional and sequential organization of talk. They also show that the much quoted ‘problem of meaning’ is, first and foremost, a participant's problem.
In an earlier publication it was claimed that there is no useful relationship between Swahili-English dictionary look-up frequencies and the occurrence frequencies for the same wordforms in Swahili-English corpora, at least not beyond the top few thousand wordforms. This result was challenged using data for German by a different team of researchers using an improved methodology. In the present article the original Swahili-English data is revisited, using ten years’ worth of it rather than just two, and using the improved methodology. We conclude that there is indeed a positive relationship. In addition, we show that online dictionary look-up behaviour is remarkably similar across languages, even when, as in our case, one is dealing with languages from very dissimilar language families. Furthermore, online dictionaries turn out to have minimum look-up success rates, below which they simply cannot go. These minima are language-sensitive and vary depending on the regularity of the searched-for entries, but are otherwise constant no matter the size of randomly sampled dictionaries. Corpus-informed sampling always improves on any random method. Lastly, from the point of view of the graphical user interface, we argue that the average user of an online bilingual dictionary is better served with a single search box, rather than separate search boxes for each dictionary side.