OPUS 4 | Search

13 search hits

1 to 10

Sort by

Year
Year
Title
Title
Author
Author

The MULI Project: Annotation and Analysis of Information Structure in German and English (2004)

Baumann, Stefan ; Brinckmann, Caren ; Hansen-Schirra, Silvia ; Kruijff, Geert-Jan ; Kruijff-Korbayová, Ivana ; Neumann, Stella ; Steiner, Erich ; Teich, Elke ; Uszkoreit, Hans

The goal of the MULI (MUltiLingual Information structure) project is to empirically analyse information structure in German and English newspaper texts. In contrast to other projects in which information structure is annotated and investigated (e.g. in the Prague Dependency Treebank, which mirrors the basic information about the topic-focus articulation of the sentence), we do not annotate theory-biased categories like topic-focus or theme-rheme. Trying to be as theory-independent as possible, we annotate those features which are relevant to information structure and on the basis of which typical patterns, co-occurrences or correlations can be determined. We distinguish between three annotation levels: syntax, discourse and prosody. The data is based on the TIGER Corpus for German and the Penn Treebank for English, since the existing information on part-of-speech and syntactic structure can be re-used for our purposes. The actual annotation of an English example sequence illustrates our choice of categories on each level. Their combination offers the possibility to investigate how information structure is realised and can be interpreted.

Multi-dimensional annotation of linguistic corpora for investigating information structure (2004)

Baumann, Stefan ; Brinckmann, Caren ; Hansen-Schirra, Silvia ; Kruijff, Geert-Jan ; Kruijff-Korbayová, Ivana ; Neumann, Stella ; Teich, Elke

We present the annotation of information structure in the MULI project. To learn more about the information structuring means in prosody, syntax and discourse, theory- independent features were defined for each level. We describe the features and illustrate them on an example sentence. To investigate the interplay of features, the representation has to allow for inspecting all three layers at the same time. This is realised by a stand-off XML mark-up with the word as the basic unit. The theory-neutral XML stand-off annotation allows integrating this resource with other linguistic resources such as the Tiger Treebank for German or the Penn treebank for English.

Linguistic Variation and Change in 250 Years of English Scientific Writing: A Data-Driven Approach (2020)

Bizzoni, Yuri ; Degaetano-Ortlieb, Stefania ; Fankhauser, Peter ; Teich, Elke

We trace the evolution of Scientific English through the Late Modern period to modern time on the basis of a comprehensive corpus composed of the Transactions and Proceedings of the Royal Society of London, the first and longest-running English scientific journal established in 1665. Specifically, we explore the linguistic imprints of specialization and diversification in the science domain which accumulate in the formation of “scientific language” and field-specific sublanguages/registers (chemistry, biology etc.). We pursue an exploratory, data-driven approach using state-of-the-art computational language models and combine them with selected information-theoretic measures (entropy, relative entropy) for comparing models along relevant dimensions of variation (time, register). Focusing on selected linguistic variables (lexis, grammar), we show how we deploy computational language models for capturing linguistic variation and change and discuss benefits and limitations.

Data Mining with Shallow vs. Linguistic Features to Study Diversification of Scientific Registers (2014)

Degaetano-Ortlieb, Stefania ; Fankhauser, Peter ; Kermes, Hannah ; Lapshinova-Koltunski, Ekaterina ; Ordan, Noam ; Teich, Elke

We present a methodology to analyze the linguistic evolution of scientific registers with data mining techniques, comparing the insights gained from shallow vs. linguistic features. The focus is on selected scientific disciplines at the boundaries to computer science (computational linguistics, bioinformatics, digital construction, microelectronics). The data basis is the English Scientific Text Corpus (SCITEX) which covers a time range of roughly thirty years (1970/80s to early 2000s) (Degaetano-Ortlieb et al., 2013; Teich and Fankhauser, 2010). In particular, we investigate the diversification of scientific registers over time. Our theoretical basis is Systemic Functional Linguistics (SFL) and its specific incarnation of register theory (Halliday and Hasan, 1985). In terms of methods, we combine corpus-based methods of feature extraction and data mining techniques.

How to connect language resources, infrastructures, and communities (2022)

Draxler, Christoph ; Geyken, Alexander ; Hinrichs, Erhard ; Klosa-Kückelhaus, Annette ; Teich, Elke ; Trippel, Thorsten

This chapter will present lessons learned from CLARIN-D, the German CLARIN national consortium. Members of the CLARIN-D communities and of the CLARIN-D consortium have been engaged in innovative, data-driven, and community-based research, using language resources and tools in the humanities and neigh-bouring disciplines. We will present different use cases and users’ stories that demonstrate the innovative research potential of large digital corpora and lexical resources for the study of language change and variation, for language documentation, for literary studies, and for the social sciences. We will emphasize the added value of making language resources and tools available in the CLARIN distributed research infrastructure and will discuss legal and ethical issues that need to be addressed in the use of such an infrastructure. Innovative technical solutions for accessing digital materials still under copyright and for data mining such materials will be presented. We will outline the need for close interaction with communities of interest in the areas of curriculum development, data management, and training the next generation of digital humanities scholars. The importance of community-supported standards for encoding language resources and the practice of community-based quality control for digital research data will be presented as a crucial step toward the provisioning of high quality research data. The chapter will conclude with a discussion of impor-tant directions for innovative research and for supporting infrastructure development over the next decade and beyond.

Combining macro- and microanalysis for exploring the construal of scientific disciplinarity (2014)

Fankhauser, Peter ; Kermes, Hannah ; Teich, Elke

Topical Diversification Over Time In The Royal Society Corpus (2016)

Fankhauser, Peter ; Knappen, Jörg ; Teich, Elke

Exploring and visualizing variation in language resources (2014)

Fankhauser, Peter ; Knappen, Jörg ; Teich, Elke

Language resources are often compiled for the purpose of variational analysis, such as studying differences between genres, registers, and disciplines, regional and diachronic variation, influence of gender, cultural context, etc. Often the sheer number of potentially interesting contrastive pairs can get overwhelming due to the combinatorial explosion of possible combinations. In this paper, we present an approach that combines well understood techniques for visualization heatmaps and word clouds with intuitive paradigms for exploration drill down and side by side comparison to facilitate the analysis of language variation in such highly combinatorial situations. Heatmaps assist in analyzing the overall pattern of variation in a corpus, and word clouds allow for inspecting variation at the level of words.

Wohin damit? Storing and reusing my language data: Minute Madness der Datenzentren (2023)

Präsentiert beim Workshop "Wohin damit? Storing and reusing my language data" am 22. Juni 2023 in Mannheim. Die Präsentation wurde im Kontext der Arbeit des Vereins Nationale Forschungsdateninfrastruktur (NFDI) e.V. gehalten.

Gute Forschungsdaten, bessere Forschung: wie Forschung durch Forschungsdatenmanagement unterstützt wird (2018)

Mache, Beata ; Trippel, Thorsten ; Effinger, Maria ; Gradl, Tobias ; Haaf, Susanne ; Hinrichs, Erhard ; Horstmann, Wolfram ; Müller, Lydia ; Schrade, Torsten ; Teich, Elke

In diesem Panel geht es um die Förderung der geisteswissenschaftlichen Forschung durch eine planvolle Erhebung, Archivierung, Veröffentlichung und die dadurch ermöglichte Nachnutzung von Forschungsdaten, die sowohl zur Qualitätssicherung in der Forschung beitragen als auch nicht zuletzt neue Fragestellungen erlauben. Aus unterschiedlichen Perspektiven soll in dem Panel beleuchtet werden, welchen Mehrwert das Datenmanagement für die Forschung in den digitalen Geisteswissenschaften hat, wie man diesen Mehrwert erreicht und auch die Veröffentlichung der Forschungsdaten als ein selbstverständliches Element der Dissemination der Forschungsergebnisse etabliert und wie man gleichzeitig den Aufwand für die Forschung abschätzen kann.

1 to 10

Open Access

Refine

Author

Year of publication

Document Type

Language

Has Fulltext

Is part of the Bibliography

Keywords

Publicationstate

Reviewstate

Publisher

13 search hits