Refine
Year of publication
Document Type
- Part of a Book (28)
- Article (17)
- Conference Proceeding (16)
- Book (1)
Keywords
- Lehnwort (20)
- Computerunterstützte Lexikographie (15)
- Deutsch (14)
- Sprachstatistik (7)
- Wörterbuch (7)
- Linguist (6)
- Biografie (4)
- Korpus <Linguistik> (4)
- Online-Wörterbuch (4)
- Russisch (4)
Publicationstate
- Veröffentlichungsversion (35)
- Postprint (5)
- Zweitveröffentlichung (2)
Reviewstate
- Peer-Review (15)
- (Verlags)-Lektorat (12)
- Verlags-Lektorat (12)
- Peer-review (1)
Publisher
- Niemeyer (6)
- De Gruyter (4)
- IDS-Verlag (3)
- Lexical Computing CZ s.r.o. (3)
- Sagner (3)
- Institut für Deutsche Sprache (2)
- Trojina, Institute for Applied Slovene Studies (2)
- Zenodo (2)
- Accademia della Crusca (1)
- Democritus University of Thrace (1)
One of the fundamental questions about human language is whether all languages are equally complex. Here, we approach this question from an information-theoretic perspective. We present a large scale quantitative cross-linguistic analysis of written language by training a language model on more than 6500 different documents as represented in 41 multilingual text collections consisting of ~ 3.5 billion words or ~ 9.0 billion characters and covering 2069 different languages that are spoken as a native language by more than 90% of the world population. We statistically infer the entropy of each language model as an index of what we call average prediction complexity. We compare complexity rankings across corpora and show that a language that tends to be more complex than another language in one corpus also tends to be more complex in another corpus. In addition, we show that speaker population size predicts entropy. We argue that both results constitute evidence against the equi-complexity hypothesis from an information-theoretic perspective.
The Lehnwortportal Deutsch (2012 seqq.) serves as an integrated online information system on German lexical borrowings into other languages, synthesizing an increasing number of lexicographical dictionaries and providing basic cross-resource search options. The paper discusses the far-reaching revision of the system’s conceptual, lexicographical and technological underpinnings currently under way, focussing on their relevance for multilingual loanword lexicography.
In this paper we present an experimental semantic search function, based on word embeddings, for an integrated online information system on German lexical borrowings into other languages, the Lehnwortportal Deutsch (LWPD). The LWPD synthesizes an increasing number of lexicographical resources and provides basic cross-resource search options. Onomasiological access to the lexical units of the portal is a highly desirable feature for many research questions, such as the likelihood of borrowing lexical units with a given meaning (Haspelmath & Tadmor, 2009; Zeller, 2015). The search technology is based on multilingual pre-trained word embeddings, and individual word senses in the portal are associated with word vectors. Users may select one or more among a very large number of search terms, and the database returns lexical items with word sense vectors similar to these terms. We give a preliminary assessment of the feasibility, usability and efficacy of our approach, in particular in comparison to search options based on semantic domains or fields.
The web portal Lehnwortportal Deutsch (lwp.ids-mannheim.de), developed at the Institute for the German Language (IDS), aims to provide unified access to existing and possibly new dictionaries of German loanwords in other languages. Internally, the lexicographical information is represented as a directed acyclic graph of relations between words. The graph abstracts from the idiosyncrasies of the individual component dictionaries. This paper explores two different strategies to make complex graph-based cross-dictionary queries in such a portal more accessible to users. The first strategy effectively hides the underlying graph structure, but allows users to assign scopes (internally defined in terms of the graph structure) to search criteria. A second type of search strategy directly formulates queries in terms of the relational graph structure. In this case, search results are not entries but n-tuples of words (metalemmata, loanwords, etyma); a query consists of specifying properties of these words and relations between them. A working prototype of an easy-to-use human-readable declarative query language is presented and ways to interactively construct queries are discussed.
This paper presents a dictionary writing system developed at the Institute for the German Language in Mannheim (IDS) for an ongoing international lexicographical project that traces the way of German loanwords in the East Slavic languages Russian, Belarusian and Ukrainian that were possibly borrowed via Polish. The results will be published in the Lehnwortportal Deutsch (LWP, lwp.ids-mannheim.de), a web portal for loanword dictionaries with German as the common donor language. The system described here is currently in use for excerpting data from a large range of historical and contemporary East Slavic monolingual dictionaries. The paper focuses on the tools that help in merging excerpts that are etymologically related to one and the same Polish etymon. The merging process involves eliminating redundancies and inconsistencies and, above all, mapping word senses of excerpted entries onto a common cross-language set of ‘metasenses’. This mapping may involve literally hundreds of excerpted East Slavic word senses, including quotations, for one ‘underlying’ Polish etymon.
This paper reports on an ongoing international project of compiling a freely accessible online Dictionary of German Loans in Polish Dialects. The dictionary will be the first comprehensive lexicographic compendium of its kind, serving as a complement to existing resources on German lexical loans in the literary or standard language. The empirical results obtained in the project will shed new light on the distribution of German loanwords among different dialects, also in comparison to the well-documented situation in written Polish. The dictionary will have a strong focus on the dialectal distribution of Polish dialectal variants for a given German etymon, accessible through interactive cartographic representations and corresponding search options. The editorial process is realized with dedicated collaborative web tools. The new resource will be published as an integrated part of an online information system for German lexical borrowings in other languages, the Lehnwortportal Deutsch, and is therefore highly cross-linked with other loanword dictionaries on Polish as well as Slavic and further European languages.
The representation of semantic relations between word senses of different entries in a dictionary is subject to a number of consistency requirements. This paper discusses the issue of maintaining and accessing consistent information on cross-references between sense-related items in electronic dictionaries from a mainly text-technological point of view. We present a number of consistency criteria for cross-referencing related senses and propose a practical approach to handling sense relations in an online dictionary. Our proposal is currently being tested in a large ongoing online dictionary project for German called elexiko. We focus on three different aspects of the dictionary development and editing process where consistency is an important issue: lexicographic data modelling, implementation of a lexicographic database system for an electronic dictionary, and development of practical tools for the lexicographer’s workbench.
The web portal Lehnwortportal Deutsch <lwp.ids-mannheim.de>, developed at the Institute for the German Language (IDS), aims to provide unified access to a growing number of lexicographical resources on German loanwords in other languages. This paper discusses different possibilities of creating an onomasiological access structure for portal users. We critically examine the meaning list of the “World Loanword Database” project (Haspelmath/Tadmor 2009a) as well as WordNet-based taxonomies and propose a new way of inductively creating a semantic classification scheme that takes both hyperonymic relations and semantic fields into account. We show how such a classification can be integrated into the underlying graph-based data representation of the Lehnwortportal and thus be exploited for advanced onomasiological search options.
Datenmodellierung
(2016)
We start by trying to answer a question that has already been asked by de Schryver et al. (2006): Do dictionary users (frequently) look up words that are frequent in a corpus. Contrary to their results, our results that are based on the analysis of log files from two different online dictionaries indicate that users indeed look up frequent words frequently. When combining frequency information from the Mannheim German Reference Corpus and information about the number of visits in the Digital Dictionary of the German Language as well as the German language edition of Wiktionary, a clear connection between corpus and look-up frequencies can be observed. In a follow-up study, we show that another important factor for the look-up frequency of a word is its temporal social relevance. To make this effect visible, we propose a de-trending method where we control both frequency effects and overall look-up trends.
In this paper, the authors use the 2012 log files of two German online dictionaries (Digital Dictionary of the German Language and the German Version of Wiktionary) and the 100,000 most frequent words in the Mannheim German Reference Corpus from 2009 to answer the question of whether dictionary users really do look up frequent words, first asked by de Schryver et al. (2006). By using an approach to the comparison of log files and corpus data which is completely different from that of the aforementioned authors, we provide empirical evidence that indicates - contrary to the results of de Schryver et al. and Verlinde/Binon (2010) - that the corpus frequency of a word can indeed be an important factor in determining what online dictionary users look up. Finally, we incorporate word class Information readily available in Wiktionary into our analysis to improve our results considerably.
Eduard Rudolf Thurneysen
(2009)
Eduard Rudolf Thurneysen
(1996)
Der vorliegende Beitrag stellt einen neuartigen Typ von mehrsprachiger elektronischer Ressource vor, bei dem verschiedene Lehnwörterbücher zu einem "umgekehrten Lehnwörterbuch" für eine bestimmte Gebersprache zusammengefasst werden. Ein solches Wörterbuch erlaubt es, die zu einem Etymon der Gebersprache gehörigen Lehnwörter in verschiedenen Nehmersprachen zu finden. Die Entwicklung einer solchen Webanwendung, insbesondere der zugrundeliegenden Datenbasis, ist mit zahlreichen konzeptionellen Problemen verbunden, die an der Schnittstelle zwischen lexikographischen und informatischen Themen liegen. Der Beitrag stellt diese Probleme vor dem Hintergrund wünschenswerter Funktionalitäten eines entsprechenden Internetportals dar und diskutiert einen möglichen Lösungsansatz: Die Artikel der Einzelwörterbücher werden als XML-Dokumente vorgehalten und dienen als Grundlage für die gewöhnliche Online-Ansicht dieser Wörterbücher; insbesondere für portalweite Abfragen werden aber grundlegende, standardisierte Informationen zu Lemmata und Etyma aller Portalwörterbücher samt deren Varianten und Wortbildungsprodukten (hier zusammenfassend als "Portalinstanzen" bezeichnet) sowie die verschiedenartigen Relationen zwischen diesen Portalinstanzen zusätzlich in relationalen Datenbanktabelle nabgelegt, die performante und beliebig komplex strukturierte Suchabfragen gestatten.
Gerd Hentschel gehört zu den Pionieren der heutigen Computerlexikografie und der IT-gestützten Korpuserschließung. Eine seiner ersten Zeitschriftenpublikationen, mit dem Titel Einsatz von EDV und Mikrocomputer in einem lexikographischen Forschungsprojekt zum deutschen Lehnwort im Polnischen (Hentschel 1983), befasst sich mit der Frage, wie - unter den damaligen technischen Vorzeichen - Forschungs- und Dokumentationsarbeiten zu polnischen Germanismen sinnvoll durch die Verwendung von Computern unterstützt werden können. Die besagten Arbeiten mündeten später in die Online-Publikation des Wörterbuchs der deutschen Lehnwörter in der polnischen Schrift- und Standardsprache (WDLP). Es ist aus heutiger Sicht bemerkenswert, mit welchen Beschränkungen die Arbeit mit dem Computer noch vor 40 Jahren zu kämpfen hatte. Aus gegebenem Anlass sei es gestattet, diesen Punkt etwas ausführlicher zu illustrieren.
We present the conceptual foundations and basic features of fLexiCoGraph, a generic software package for creating and presenting curated human-oriented lexicographical resources that are roughly modeled according to Měchura’s (2016) idea of graph-augmented trees. The system is currently under development and will be made accessible as open source software. As a sample use case we discuss an existing online database of loanwords borrowed from German into other languages which is based on a growing number of language-specific loanword dictionaries (Lehnwortportal Deutsch). The paper outlines the conceptual foundations of fLexiCoGraph’s hybrid graph/XML data model. To establish a database, XML-based resources may be imported or even input manually. An additional graph database layer is then constructed from these XML source documents in a freely configurable, but automated way; subsequently, the resulting graph can be manipulated and enlarged through a visual user interface in such a way that keeps the relationship to the source document information explicit at all times. We sketch the tooling support for different kinds of graph-level editing processes, including mechanisms for dealing with updated XML source documents and coping with duplicate or inconsistent information, and briefly discuss the browser interface for end users.
This paper reports on an ongoing lexicographical project that investigates Polish loanwords from German that were further borrowed into the East Slavic languages Russian, Ukrainian, and Belorussian. The results will be published as three separate dictionaries in the Lehnwortportal Deutsch, a freely available web portal for loanword dictionaries having German as their common source language. On the database level, the portal models lexicographical data as a cross-resource directed acyclic graph of relations between individual words, including German ‘metalemmata’ as normalized representations of diasystemic variants of German etyma. Amongst other things, this technology makes it possible to use the web portal as an ‘inverted loanword dictionary’ to find loanwords in different languages borrowed from the same German etymon. The different possible pathways of German loanwords that went through Polish into the East Slavic languages can be represented directly as paths in the graph. A dedicated in-house dictionary editing software system assists lexicographers in producing and keeping track of these paths even in complex cases where, e.g, only a derivative of a German loanword in Polish has been borrowed into Russian. The paper concludes with some remarks on the particularities of the dictionary/portal access structure needed for presenting and searching borrowing chains.
Hermann Osthoff
(2009)
Hermann Osthoff
(1996)
A central goal of linguistics is to understand the diverse ways in which human language can be organized (Gibson et al. 2019; Lupyan/Dale 2016). In our contribution, we present results of a large scale cross-linguistic analysis of the statistical structure of written language (Koplenig/Wolfer/Meyer 2023) we approach this question from an information-theoretic perspective. To this end, we conduct a large scale quantitative cross-linguistic analysis of written language by training a language model on more than 6,500 different documents as represented in 41 multilingual text collections, so-called corpora, consisting of ~3.5 billion words or ~9.0 billion characters and covering 2,069 different languages that are spoken as a native language by more than 90% of the world population. We statistically infer the entropy of each language model as an index of un. To this end, we have trained a language model on more than 6,500 different documents as represented in 41 parallel/multilingual corpora consisting of ~3.5 billion words or ~9.0 billion characters and covering 2,069 different languages that are spoken as a native language by more than 90% of the world population or ~46% of all languages that have a standardized written representation. Figure 1 shows that our database covers a large variety of different text types, e.g. religious texts, legalese texts, subtitles for various movies and talks, newspaper texts, web crawls, Wikipedia articles, or translated example sentences from a free collaborative online database. Furthermore, we use word frequency information from the Crúbadán project that aims at creating text corpora for a large number of (especially under-resourced) languages (Scannell 2007). We statistically infer the entropy rate of each language model as an information-theoretic index of (un)predictability/complexity (Schürmann/Grassberger 1996; Takahira/Tanaka-Ishii/Dębowski 2016). Equipped with this database and information-theoretic estimation framework, we first evaluate the so-called ‘equi-complexity hypothesis’, the idea that all languages are equally complex (Sampson 2009). We compare complexity rankings across corpora and show that a language that tends to be more complex than another language in one corpus also tends to be more complex in another corpus. This constitutes evidence against the equi-complexity hypothesis from an information-theoretic perspective. We then present, discuss and evaluate evidence for a complexity-efficiency trade-off that unexpectedly emerged when we analysed our database: high-entropy languages tend to need fewer symbols to encode messages and vice versa. Given that, from an information theoretic point of view, the message length quantifies efficiency – the shorter the encoded message the higher the efficiency (Gibson et al. 2019) – this indicates that human languages trade off efficiency against complexity. More explicitly, a higher average amount of choice/uncertainty per produced/received symbol is compensated by a shorter average message length. Finally, we present results that could point toward the idea that the absolute amount of information in parallel texts is invariant across different languages.
Julius Pokorny
(1996)
Julius Pokorny
(2009)
In this paper, we address issues of inconsistencies of dictionary information and how different corpus methods and computer tools can assist in providing systematic cross-referencing. The question is raised how hyperlinking in an electronic reference work can be approached systematically in order to warrant consistent symmetrical links between synonyms or antonyms. Firstly, it is argued that working with a comprehensive corpus does not account for consistent cross-referencing. It is shown that a top-down corpus-driven linguistic analysis also does not guarantee the lexicographic documentation of binary lexico-semantic relations covered by corpus data, as proposed by Paradis/Willners (2006a, b). Secondly, with the help of dictionary examples taken from elexiko (an online dictionary of contemporary German) we demonstrate how a combination of both corpus-driven and corpus-based procedures enables lexicographers to systematically exploit corpus material in more depth than by using only one of these methods. It is also discussed where and why lexicographers are still prone to inconsistencies in the editing processes, irrespective of their underlying corpus methodologies. Finally, we introduce a cross-reference management tool that has been developed for elexiko and we explain its technological prerequisites and implications. This software supports lexicographers in detecting existing and missing references from and to a specific headword. It also offers options to automatically and comfortably correct discrepancies. Overall, we suggest a method that includes linguistic competence, complementary corpus approaches and additional software in order to ensure that links or references between synonymic and antonymic pairings are given in both directions.
According to a widespread conception, quantitative linguistics will eventually be able to explain empirical quantitative findings (such as Zipf’s Law) by deriving them from highly general stochastic linguistic ‘laws’ that are assumed to be part of a general theory of human language (cf. Best (1999) for a summary of possible theoretical positions). Due to their formal proximity to methods used in the so-called exact sciences, theoretical explanations of this kind are assumed to be superior to the supposedly descriptive-only approaches of linguistic structuralism and its successors. In this paper I shall try to argue that on close inspection such claims turn out to be highly problematic, both on linguistic and on science-theoretical grounds.
This paper presents the Lehnwortportal Deutsch, a new, freely accessible publication platform for resources on German lexical borrowings in other languages, to be launched in the second half of 2022. The system will host digital-native sources as well as existing, digitized paper dictionaries on loanwords, initially for some 15 recipient languages. All resources remain accessible as individual standalone dictionaries; in addition, data on words (etyma, loanwords etc.) together with their senses and relations to each other is represented as a cross-resource network in a graph database, with careful distinction between information present in the original sources and the curated portal network data resulting from matching and merging information on, e. g., lexical units appearing in multiple dictionaries. Special tooling is available for manually creating graphs from dictionary entries during digitization and for editing and augmenting the graph database. The user interface allows users to browse individual dictionaries, navigate through the underlying graph and ‘click together’ complex queries on borrowing constellations in the graph in an intuitive way. The web application will be available as open source.
This paper presents the Lehnwortportal Deutsch, a new, freely accessible publication platform for resources on German lexical borrowings in other languages, to be launched in the second half of 2022. The system will host digital-native sources as well as existing, digitized paper dictionaries on loanwords, initially for some 15 recipient languages. All resources remain accessible as individual standalone dictionaries; in addition, data on words (etyma, loanwords etc.) together with their senses and relations to each other is represented as a cross-resource network in a graph database, with careful distinction between information present in the original sources and the curated portal network data resulting from matching and merging information on, e. g., lexical units appearing in multiple dictionaries. Special tooling is available for manually creating graphs from dictionary entries during digitization and for editing and augmenting the graph database. The user interface allows users to browse individual dictionaries, navigate through the underlying graph and ‘click together’ complex queries on borrowing constellations in the graph in an intuitive way. The web application will be available as open source.
Am 12. Mai 1965 nahmen der Staat Israel und die Bundesrepublik Deutschland offiziell diplomatische Beziehungen auf. Damit kam über 15 Jahre nach der Konstitution der beiden Länder und 20 Jahre nach dem Ende der Shoah ein komplexer Prozess der langsamen politischen Annäherung zu einem keineswegs selbstverständlichen Abschluss. Das fünfzigjährige Jubiläum dieses Ereignisses im Jahr 2015 war weltweit, vor allem aber in Israel und Deutschland, Anlass für zahlreiche Veranstaltungen, über die eine offizielle bilaterale Webseite <www.de50il.org/> (Stand: 6.11.2017) Auskunft gibt. Im Rahmen des Jubiläums wurde am 30. September 2015 in einer feierlichen Abendveranstaltung im Jüdischen Museum Berlin offiziell das „Wörterbuch deutscher Lehnwörter im Hebräischen“ von Uriel Adiv in einer ersten Fassung im „Lehnwortportal Deutsch“ des IDS freigeschaltet. Eine von Koautor Jakob Mendel erheblich überarbeitete und verbesserte zweite Version ging im Mai 2017 online. Der vorliegende Beitrag möchte einige Hintergründe zum deutschen Lehnwortschatz im modernen Hebräischen darstellen sowie die Entstehungsgeschichte des Werks und seinen Platz in der lehnwortlexikografischen Publikationsplattform „Lehnwortportal Deutsch“ <http://lwp.ids-mannheim.de/> (Stand: 6.11.2017) beleuchten.
Graphenbasierte Ansätze spielen in der digitalen Lexikografie eine immer stärkere Rolle. Essentiell für die Erstellung, Verwaltung und Nutzung graphenbasierter lexikografischer Ressourcen ist jedoch eine leistungsfähige und zugleich einfach zu handhabende Zugriffsstruktur, die Suchen nach komplexen Konstellationen in solchen Graphen ermöglicht. Für heutige Graphendatenbanken stehen zahlreiche Abfragesprachen zur Verfügung, deren Verwendung jedoch verhältnismäßig voraussetzungsreich ist.
Das Poster stellt einen webbasierten, frei konfigurierbaren Query Builder vor, der die Formulierung semantisch sehr komplexer Suchabfragen an eine (mit dem Tinkerpop-Standard kompatible) Property-Graphendatenbank ermöglicht. Die Abfrage erfolgt durch einfaches visuell-interaktives Zusammenstellen hierarchisch angeordneter Abfrageelemente und liefert Antworten in Echtzeit. Dabei wird von den Komplexitäten der verwendeten Low-level-Abfragesprache Gremlin abstrahiert. Der Query Builder ist ein zentrales Modul eines derzeit entwickelten Open-Source-Softwaresystems zur Verwaltung und Online-Publikation graph-erweiterter lexikografischer Ressourcen.
Making 1:n explorable: a search interface for the ZAS database of clause-embedding predicates
(2017)
We introduce a recently published corpus-based database of German clause-embedding predicates and present an innovative web application for exploring it. The application displays the predicates and the corpus examples for these predicates in two separate tables that can be browsed and searched in real time. While familiar web interface paradigms make it easy for users to get started, the data presentation and the interactive advanced search components for the two tables are designed to accommodate remarkably complex query needs without the need for resorting to a dedicated query language or a more specialized tool. The 1:n relationship between predicates and their examples is exploited in the two tables in that, e.g. the predicate table also shows, for each predicate and each example attribute, all values that occur in the examples for this predicate. An easy-to-use visual query builder for arbitrary Boolean combinations of search criteria can optionally be displayed to pre-filter the underlying data presented in both tables. Several options for altering quantifier scope can be activated with simple checkboxes and considerably widen the space of searchable constellations.
This contribution outlines a conceptual analysis of the dictionary-internal cross-reference structure in electronic dictionaries along the lines of Wiegand’s actional-theoretical text theory of print dictionaries. The discussion focuses on issues of XML-based data modeling, using the monolingual German online dictionary elexiko as a running example. The first part of the article demonstrates how Wiegand’s formal theory of mediostructure and its intricate nomenclature can be extended in a systematic and lexicographically justified way to cover the structure of the underlying lexicographical database of online dictionaries. The second part of the article applies the concepts developed to a more technical question, examining the extent to which cross-reference information can be stored and processed separately from the dictionary entry documents, e.g., in a relational database. The results are largely negative; in most real world cases, this leads to an unwanted duplication of XML-related structural information. The concluding third part briefly describes the strategy chosen for elexiko: mediostructural information is not externalized at all; cross-reference consistency checks are performed by a dictionary editing tool that takes advantage of a specialized XML database index and can easily be made more efficient and scalable by using a simple caching technique.
The present study examines the dynamics of the kanji combinations that form common (or general) and proper nouns in Japanese. The following three results were obtained. First, the degree of distribution results from two similar processes which are based on a steady-state of birth-and-death processes with different birth and death rates, resulting in a positive negative binomial distribution with the proper nouns and in a positive Waring distribution with common nouns. Second, all rank-frequency distributions follow the negative hypergeometric distribution used very frequently in ranking problems. Third, the building of kanji compounds follows a dissortative strategy. The higher the outdegree of a kanji, the more it prefers kanji with lower indegrees. A linear dependence can be observed with common nouns, whereas the relationship between compounded kanji is rather curvilinear with proper nouns. The actual analytical expression is not yet known.
Physicists look at language
(2006)
Referenz und ihre Gegenstände. Bemerkungen zur Pragmatik eines sprachphilosophischen Begriffs
(1998)
Vor allem in älteren Wörterbüchern mit philologischer Ausrichtung ist die Mikrostruktur der Artikel häufig diskursiv und unsystematisch. Eine automatisierte Digitalisierung solcher Wörterbücher mit dem Ziel, ihre logische Struktur zu kodieren, ist nicht möglich; in vielen Fällen ist schon ein Parser für ein manuell nachzubearbeitendes Rohdigitalisat kein realistisches Ziel, weil die Angabetypen des Wörterbuchs nicht klar voneinander abgrenzbar und in den Einzelartikeln nicht eindeutig identifizierbar sind. In solchen Fällen wirft auch eine nachträgliche manuelle Formalisierung der Mikrostruktur große lexikografische Probleme auf. Für komplexere Anwendungsszenarien wie etwa Abfragen in Webanwendungen kann es dennoch unumgänglich sein, wenigstens sämtliche relevanten in den Artikeln diskutierten Wortformen mit grundsätzlichen diasystematischen und morphologischen Informationen sowie ihren Relationen zueinander in einem maschinell lesbaren Format strukturiert zu repräsentieren, etwa durch datenzentrierte XML-Dokumente. Der Vortrag versucht, die lexikografischen und technischen Möglichkeiten und Grenzen einer solchen teilweisen und manuellen Retrodigitalisierung am Beispiel von Erfahrungen mit einem älteren Wörterbuch deutscher Lehnwörter im Slovenischen (Striedter-Temps 1963) auszuloten. Das Wörterbuch soll in ein Portal von Lehnwörterbüchern mit Deutsch als gemeinsamer Gebersprache eingebunden werden. Die Einzelartikel werden dem Benutzer als Bilddigitalisate zur Verfügung gestellt; die zusätzliche textuelle Retrodigitalisierung ist jedoch für komplexere, insbesondere auch für wörterbuchübergreifende und portalweite, Suchabfragen erforderlich.
Das Lehnwortportal Deutsch (LWPD) ist ein Online-Informationssystem zu Entlehnungen von Wörtern aus dem Deutschen in andere Sprachen. Es beruht auf einer wachsenden Zahl von lexikographischen Ressourcen zu verschiedenen Sprachen und bietet eine einfache ressourcenübergreifende Suchfunktion an. Das Poster präsentiert eine derzeit in Entwicklung befindliche onomasiologische Suchfunktion für das LWPD.
Sprachkritik, dahinsickernd
(2007)
Three popular collections of essays concerning correct language use in German are reviewed from a linguist’s point of view. It is claimed that the overall picture of language that Sick conveys to the layperson is inadequate; in addition, the author fails to reflect explicitly on the purpose and consequences of his prescriptive approach to language use.
Das Kapitel widmet sich den grundsätzlichen technischen Rahmenbedingungen für die heutige Internetlexikografie. Zum einen skizzieren die Autoren, was „hinter“ den auf einem Monitor sichtbaren Benutzeroberflächen geschieht, wenn eine Nutzerin online auf ein Wörterbuch zugreift, und wie diese Prozesse zu Dokumentationszwecken in I.ogdateien protokolliert werden können. Zum anderen diskutieren sie, wie die Identität und dauerhafte Verfügbarkeit von Inhalten angesichts der ständig möglichen Aktualisierbarkeit von Online-Angeboten sichergestellt werden können.
Languages employ different strategies to transmit structural and grammatical information. While, for example, grammatical dependency relationships in sentences are mainly conveyed by the ordering of the words for languages like Mandarin Chinese, or Vietnamese, the word ordering is much less restricted for languages such as Inupiatun or Quechua, as these languages (also) use the internal structure of words (e.g. inflectional morphology) to mark grammatical relationships in a sentence. Based on a quantitative analysis of more than 1,500 unique translations of different books of the Bible in almost 1,200 different languages that are spoken as a native language by approximately 6 billion people (more than 80% of the world population), we present large-scale evidence for a statistical trade-off between the amount of information conveyed by the ordering of words and the amount of information conveyed by internal word structure: languages that rely more strongly on word order information tend to rely less on word structure information and vice versa. Or put differently, if less information is carried within the word, more information has to be spread among words in order to communicate successfully. In addition, we find that–despite differences in the way information is expressed–there is also evidence for a trade-off between different books of the biblical canon that recurs with little variation across languages: the more informative the word order of the book, the less informative its word structure and vice versa. We argue that this might suggest that, on the one hand, languages encode information in very different (but efficient) ways. On the other hand, content-related and stylistic features are statistically encoded in very similar ways.
The wdlpOst dictionary writing system to be presented in this paper has been developed for the specific purposes of a lexicographical project on German loanwords in the East Slavic languages Russian, Belarusian, and Ukrainian. The project’s main objectives are (i) to document those loanwords for which a cognate lexical borrowing from German is known in Polish and (ii) to establish possible borrowing pathways for these lexical items. In the first phase of the project, the collaborative client/server architecture of the wdlpOst system has been used for excerpting detailed lexicographical information from a large range of historical and contemporary East Slavic dictionaries, taking the entries in a large dictionary of German loanwords in Polish as a common frame of reference. For the project’s second phase, the wdlpOst system provides innovative tooling for compiling entries of the East Slavic loanwords. Most importantly, the numerous word sense definitions for a set of cognate loanwords, as excerpted from different lexicographical sources, are mapped onto a system of newly defined cross-language word senses; in a similar vein, the phonemic and graphemic variation in the loanwords and their derivatives is captured through a tool that abstracts from dictionary-specific idiosyncrasies.
vernetziko is an assistive software tool primarily designed for managing cross-references in XML-based electronic dictionaries. In its current form it has been developed as an integral part of the lexicographic editing environment for the German monolingual dictionary elexiko developed and compiled at the Institut für Deutsche Sprache, Mannheim. This paper first briefly outlines how vernetziko fits into the XML-based dictionary editing technology of elexiko. Then vernetziko’s core functionality and some of the auxiliary tools integrated into the program are presented from both a practical and a technological point of view. The concluding sections discuss some software engineering aspects of extending the tool to handle cross-references between multiple resources and point out some of the advantages of vernetziko vis-à-vis corresponding features of proprietary dictionary writing systems. The software can be adapted to interconnect off-the-shelf components (database management systems and editors), thus providing a tailor-made lexicographical workbench for a wide range of XML-based dictionaries without vendor lock-in.
Open peer commentary on the target article “Who Conceives of Society?” by Ernst von Glasersfeld. Excerpt: I will focus on one crucial step in von Glasersfeld’s argumentation, viz. his view that every individual constructs his own private meanings (understood as conceptual structures or elements thereof) for linguistic expressions, so that linguistic interaction and even communication in general is based on a notion of compatibility between different speakers’ private conceptual schemes. The central question here is: “Just what does it mean that different private conceptual schemes (private meanings) are compatible, or what constitutes a viable criterion to this end?” As von Glasersfeld himself stresses twice (§28, §37), the criteria to be looked for can only be “public,” residing in properties of verbal and non-verbal actions of the interacting individuals, properties that can be sensed and processed by the participating system.
This paper deals with the distribution of word length in short native mythological and historical Eskimo narrative texts. To my knowledge, no Eskimo‐Aleut data have been the object of quantitative linguistic investigation so far. Due to the strong linguistic and Stylistic homogeneity of the examined texts it was assumed that these texts can be subsumed under a single law of word length distribution, if word length distribution of a text is considered as a function of certain of its properties, such as author, language, and genre. So far, word length distribution in texts of a wide variety of languages and genres has been demonstrated to follow distributions of the compound Poisson family of discrete probability distributions. In view of the morphological idiosyncrasies of the Eskimo language in general, which are responsible for an unusually high mean word length of about 4.5 to 5.2 syllables per word in the texts, it is interesting to see whether Eskimo texts show a significantly different behaviour with respect to word length. The results demonstrate that the Eskimo data employed in this study can be fitted well by the Hyperpoisson distribution. Two further discrete probability distributions will be deduced from certain morphology‐based assumptions about Eskimo. It turns out that most of the Eskimo data can be fitted by these two distributions. The question to what extent these results point to a more grammar‐oriented theory of word length is also discussed.
In dem Beitrag präsentieren und diskutieren die Autoren zunächst einige Untersuchungen aus der Benutzungsforschung zu elektronischen Wörterbüchern, die sich mit der nutzerseitigen Beurteilung des Mehrwerts multimedialer und benutzeradaptiver Elemente befassen (Kap. 1. In einem zweiten Teil versuchen sie, ausgehend von den Stärken und Schwächen vorhandener Ansätze in diesem Bereich, Antworten auf die Frage zu finden, welche Anforderungen an Visualisierungstechniken und ‑strategien in elektronischen Wörterbüchern gestellt werden müssen, um einen solchen Mehrwert zu erhalten (Kap. 2). Abschließend stellen sie als praktisches Beispiel für eine mögliche Umsetzung solcher Anforderungen den Prototyp einer Software zur interaktiven Erkundung von Wortbildungsangaben im Wörterbuch vor.