OPUS 4 | Search

10108 search hits

501 to 510

Sort by

Sprachreport Jg. 39 (2023), Heft 2 (2023)

Segmentierungs- und Annotationsverfahren für die Texte Udo Lindenbergs: Apostrophe und andere Herausforderungen (2023)

In der Computerlinguistik ist eine kaskadische Prozessierung von Texten üblich. Dabei werden diese zuerst segmentiert (tokenisiert), d.h. Tokens und ggf. Satzgrenzen werden erkannt. Dabei entsteht meist eine Liste bzw. eine einspaltige Tabelle, die sukzessive durch weitere Prozessierungschritte um zusätzliche Spalten – also positionale Annotationen wie z.B. Wortarten und Lemmata für die Tokens in der ersten Spalte – ergänzt wird. Bei der Tokenisierung werden alle Spatien (Leerzeichen) gelöscht. Schon immer problematisch waren dabei Interpunktionszeichen, da diese äußerst ambig sein können, aber auch mehrteilige Namen, die Leerzeichen enthalten und eigentlich zusammengehören. Dieser Beitrag fokussiert auf den Apostroph, der in vielfältiger Weise in den Texten Udo Lindenbergs eingesetzt wird sowie auf mehrteilige Namen, die wir als Tokens erhalten möchten. Wir nutzen dafür das komplette Lindenberg-Archiv des song-korpus.de-Repositoriums, kategorisieren die auftretenden Phänomene, erstellen einen Goldstandard und entwickeln ein teils regel-, teils auf maschinellem Lernen basierendes Segmentierungswerkzeug, das insbesondere die auftretenden Apostrophe, aber auch -lexikonbasiert - mehrteilige Namen nach unseren Vorstellungen erkennt und tokenisiert. Im Anschluss trainieren wir den RNN-Tagger (Schmid, 2019) und zeigen auf, dass ein spezifisch für diese Texte angepasstes Training zu Genauigkeiten ≥ 96% führt. Dabei entsteht nicht nur ein Goldstandard des annotierten Korpus, das dem Songkorpus-Repositorium zur Verfügung gestellt wird, sondern auch eine angepasste Version des RNN-Taggers (verfügbar auf github), die für ähnliche Texte verwendet werden kann.

To drop or not to drop? Predicting the omission of the infinitival marker in a Swedish future construction (2024)

Berdicevskis, Aleksandrs ; Coussé, Evie ; Koplenig, Alexander ; Adesam, Yvonne

We investigate the optional omission of the infinitival marker in a Swedish future tense construction. During the last two decades the frequency of omission has been rapidly increasing, and this process has received considerable attention in the literature. We test whether the knowledge which has been accumulated can yield accurate predictions of language variation and change. We extracted all occurrences of the construction from a very large collection of corpora. The dataset was automatically annotated with language-internal predictors which have previously been shown or hypothesized to affect the variation. We trained several models in order to make two kinds of predictions: whether the marker will be omitted in a specific utterance and how large the proportion of omissions will be for a given time period. For most of the approaches we tried, we were not able to achieve a better-than-baseline performance. The only exception was predicting the proportion of omissions using autoregressive integrated moving average models for one-step-ahead forecast, and in this case time was the only predictor that mattered. Our data suggest that most of the language-internal predictors do have some effect on the variation, but the effect is not strong enough to yield reliable predictions.

Prologue: language challenges in the 21st century (2020)

Birney, Megan E. ; Roessel, Janin ; Hansen, Karolina ; Rakić, Tamara

As immigration and mobility increases, so do interactions between people from different linguistic backgrounds. Yet while linguistic diversity offers many benefits, it also comes with a number of challenges. In seven empirical articles and one commentary, this Special Issue addresses some of the most significant language challenges facing researchers in the 21st century: the power language has to form and perpetuate stereotypes, the contribution language makes to intersectional identities, and the role of language in shaping intergroup relations. By presenting work that aims to shed light on some of these issues, the goal of this Special Issue is to (a) highlight language as integral to social processes and (b) inspire researchers to address the challenges we face. To keep pace with the world’s constantly evolving linguistic landscape, it is essential that we make progress toward harnessing language’s power in ways that benefit 21st century globalized societies.

Linguistic landscape under strict state language policy: reversing the Soviet legacy in a regional centre in latvia (2010)

Marten, Heiko F.

This chapter will present results of a linguistic landscape (LL) project in the regional centre of Rēzekne in the region of Latgale in Eastern Latvia. Latvia was de facto a part of the Soviet Union until 1991, and this has given it a highly multilingual society. In the essentially post-colonial situation since 1991, strict language policies have been in place, which aim to reverse the language shift from Russian, the dominant language of Soviet times, back to Latvian. Thus, the main interests of the research were how the complex pattern of multilingualism in Latvia is reflected in the LL; how people relate to current language legislation; and what motivations, attitudes and emotions inform their behaviour.

Exploring sexual harassment and related attitudes in Beninese high schools: a field study (2013)

Waubert de Puiseau, Berenike ; Roessel, Janin

Sexual harassment severely impacts the educational system in the West African country Benin and the progress of women in this society that is characterized by great gender inequality. Knowledge of the belief systems rooting in the sociocultural context is crucial to the understanding of sexual harassment. However, no study has yet investigated how sexual harassment is related to fundamental beliefs in Benin or West African countries. We conducted a field study on 265 female and male students from several high schools in Benin to investigate the link between sexual harassment and measures of ambivalent sexism, gender identity, and rape myth acceptance. Almost half of the sample reported having experienced sexual harassment personally or among peers. Levels of sexism and rape myth acceptance were very high compared to other studies. These attitudes appeared to converge in a sexist belief system that was linked to personal experiences, the perceived probability of experiencing and fear of sexual harassment. Results suggest that sexual harassment is a societal problem and that interventions need to address fundamental attitudes held in societies low in gender equality.

Ist das 3-Kreise-Modell für das Englische auf das Deutsche übertragbar? Eine Bestandsaufnahme gesellschaftlicher Funktionen der heutigen deutschen Sprache weltweit (2011)

Marten, Heiko F.

In der akademischen Diskussion zum Global English hat sich seit den 1980er Jahren ein Modell etabliert, das die Staaten, in denen Englisch gesprochen wird, idealtypisch in drei Kreise einteilt: Den Inneren Kreis, in dem Englisch wichtigste Sprache der Gesellschaft sowie L1 eines Großteils der Bevölkerung ist, den Äußeren Kreis, wo Englisch L2 und eine wichtige Sprache unter mehreren ist, sowie den Erweiterten oder Expandierenden Kreis, in dem Englisch als Fremdsprache und als Lingua Franca dominiert (Kachru, 1985). Dieser Beitrag zeigt anhand einer Bestandsaufnahme gesellschaftlicher Funktionen des Deutschen weltweit, dass dieses Modell auch auf das Deutsche übertragen werden kann. Allerdings unterscheidet sich das Deutsche in einigen erheblichen Aspekten vom Englischen: Zum Inneren Kreis gehören die Länder des deutschsprachigen Kerngebietes, zum Äußeren Kreis Länder, in denen Deutsch anerkannte Minderheitensprache ist, und zum Erweiterten (oder im Falle des Deutschen eher Bröckelnden) Kreis Länder, in denen es einzelne deutsche Sprachinseln oder eine deutschsprachige Diaspora gibt, wobei letztere auch erst in jüngster Zeit entstanden sein kann. Schließlich diskutiert der Aufsatz die Position des Baltikums in diesem Modell.

A general lexicographic model for a typological variety of dictionaries in African languages (2014)

Faaß, Gertrud ; Bosch, Sonja E. ; Gouws, Rufus H.

So far, there have been few descriptions on creating structures capable of storing lexicographic data, ISO 24613:2008 being one of the latest. Another one is by Spohr (2012), who designs a multifunctional lexical resource which is able to store data of different types of dictionaries in a user-oriented way. Technically, his design is based on the principle of a hierarchical XML/OWL (eXtensible Markup Language/Web Ontology Language) representation model. This article follows another route in describing a model based on entities and relations between them; MySQL (usually referred to as: Structured Query Language) describes a database system of tables containing data and definitions of relations between them. The model was developed in the context of the project "Scientific eLexicography for Africa" and the lexicographic database to be built thereof will be implemented with MySQL. The principles of the ISO model and of Spohr's model are adhered to with one major difference in the implementation strategy: we do not place the lemma in the centre of attention, but the sense description — all other elements, including the lemma, depend on the sense description. This article also describes the contained lexicographic data sets and how they have been collected from different sources. As our aim is to compile several prototypical internet dictionaries (a monolingual Northern Sotho dictionary, a bilingual learners' Xhosa–English dictionary and a bilingual Zulu–English dictionary), we describe the necessary microstructural elements for each of them and which principles we adhere to when designing different ways of accessing them. We plan to make the model and the (empty) database with all graphical user interfaces that have been developed, freely available by mid-2015.

Warum wir so wenig über die Sprachen in Deutschland wissen. Spracheinstellungen als Erkenntnisbarriere (2021)

Adler, Astrid ; Ribeiro Silveira, Maria

Bislang gibt es keine akkuraten, repräsentativen Statistiken dazu, welche Sprachen in Deutschland gesprochen werden. Zwar wird in verschiedenen Erhebungen nach Muttersprachen oder nach zuhause gesprochenen Sprachen gefragt; aufgrund einiger Mängel im Erhebungsdesign bilden die Ergebnisse der vorliegenden Erhebungen jedoch die sprachliche Realität der in Deutschland lebenden Bevölkerung nicht angemessen ab. Im Beitrag wird anhand von drei Erhebungen gezeigt, dass bereits die Instrumente zur Erhebung von Sprache von Spracheinstellungen geprägt sind und dass dadurch die Gültigkeit der Ergebnisse stark eingeschränkt wird. Diese Mängel gelten für Sprachstatistiken im Hinblick auf die gesamte Bevölkerung Deutschlands – Kinder und Jugendliche eingeschlossen.

Ausgangspunkte für die Betrachtung von Textqualität im digitalen Zeitalter (2023)

Wolfer, Sascha ; Abel, Andrea ; Glaznieks, Aivars ; Linthe, Maja

Dieser Beitrag gibt einen Überblick über die methodischen Ausgangspunkte des Projekts MIT. Qualität und stellt einige zentrale Erkenntnisse zur Modellbildung, der korpuslinguistischen Analyse und Akzeptabilitätserhebungen in der Sprachgemeinschaft vor. Wir zeigen dabei, wie bestehende Textqualitätsmodelle anhand einer Analyse einschlägiger Ratgeberliteratur erweitert werden können. Es wurden zwei empirische Fallstudien durchgeführt, die beide auf die Herstellung von textueller Kohärenz mittels des Kausalkonnektors weil fokussieren. Wir stellen zunächst eine korpuskontrastive Analyse vor. Weiterhin zeigen wir, wie man anhand verschiedener Aufgabenstellungen diverse Aspekte von Akzeptabilität in der Sprachgemeinschaft abprüfen kann.

501 to 510

Open Access

Refine

Author

Year of publication

Document Type

Language

Has Fulltext

Is part of the Bibliography

Keywords

Publicationstate

Reviewstate

Publisher

10108 search hits