400 Sprache, Linguistik
Refine
Year of publication
Document Type
- Part of a Book (1018)
- Article (645)
- Conference Proceeding (358)
- Book (156)
- Other (52)
- Review (52)
- Doctoral Thesis (17)
- Working Paper (17)
- Part of Periodical (12)
- Preprint (8)
Language
- German (1374)
- English (885)
- French (27)
- Russian (27)
- Multiple languages (16)
- Portuguese (5)
- Ukrainian (5)
- Polish (4)
- Latvian (3)
- Spanish (2)
Keywords
- Deutsch (758)
- Korpus <Linguistik> (392)
- Konversationsanalyse (174)
- Interaktion (162)
- Computerlinguistik (142)
- Grammatik (117)
- Gesprochene Sprache (108)
- Kommunikation (106)
- Wörterbuch (100)
- Mehrsprachigkeit (89)
Publicationstate
- Veröffentlichungsversion (1141)
- Zweitveröffentlichung (703)
- Postprint (209)
- Ahead of Print (5)
- Erstveröffentlichung (3)
- Preprint (2)
Reviewstate
- (Verlags)-Lektorat (1098)
- Peer-Review (783)
- Peer-review (22)
- Qualifikationsarbeit (Dissertation, Habilitationsschrift) (19)
- Abschlussarbeit (Bachelor, Master, Diplom, Magister) (Bachelor, Master, Diss.) (4)
- Verlagslektorat (2)
- (Verlag)-Lektorat (1)
- (Verlags-)Lektorat (1)
- (Verlags-)lektorat (1)
- Peer review (1)
Publisher
- de Gruyter (431)
- Narr Francke Attempto (74)
- Leibniz-Institut für Deutsche Sprache (IDS) (59)
- Narr (55)
- Lang (51)
- Winter (51)
- Institut für Deutsche Sprache (50)
- Akademie-Verlag (46)
- Benjamins (44)
- Association for Computational Linguistics (36)
Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus
(2021)
Since the introduction of large language models in Natural Language Processing, large raw corpora have played a crucial role in Computational Linguistics. However, most of these large raw corpora are either available only for English or not available to the general public due to copyright issues. Nevertheless, there are some examples of freely available multilingual corpora for training Deep Learning NLP models, such as the OSCAR and Paracrawl corpora. However, they have quality issues, especially for low-resource languages. Moreover, recreating or updating these corpora is very complex. In this work, we try to reproduce and improve the goclassy pipeline used to create the OSCAR corpus. We propose a new pipeline that is faster, modular, parameterizable, and well documented. We use it to create a corpus similar to OSCAR but larger and based on recent data. Also, unlike OSCAR, the metadata information is at the document level. We release our pipeline under an open source license and publish the corpus under a research-only license.
The focus of this paper will be on lexical information systems and the framework guidelines for the definition of the curricula within the educational system of the Autonomous Province of Bolzano/ Bozen (Italy). In Italy, the competences to be achieved at different school levels are published in the form of general guidelines. On this basis each school has to specify the general competency goals and to spell them out in a concrete curriculum. In this paper I will examine to what extent lexical information systems are represented in the framework guidelines within the German and the Italian educational system of the Autonomous Province, these being separate systems. In a second step, I will check the representations of the resources against the “Villa Vigoni Theses on Lexicography“. Finally, I will discuss the results and give an outlook for further research.
Personalpronomina und ihre reduzierten und klitischen Formen stehen an markanten Satzpositionen, die sich von der Position der koreferenten vollen Nomina grundsätzlich unterscheiden. Sie erscheinen allerdings in manchen Sprachen verbbezogen als Enklitika, in anderen als Proklitika. Es wird zuerst erwogen,diese enklitische bzw. proklitische Position von der grundlegenden Linearitätstypologie im greenbergschen Sinne (SVO und SOV/VSO) abhängig zu machen. Wiewohl prinzipiell richtig zwingen klitische Pronomina im Skandinavischen sowie die Klitikstellung in Nichtdeklarativen zur Annahme,nach der ersten (rechtesten) thematischen Diskursposition im strukturellen Satzschema als Ort für die schwachen Pronominalformen zu suchen. Diese Annahme erscheint für eine Reihe von nichtverwandten Sprachen als haltbar. Im Blickpunkt stehen Sprachen aus den drei greenbergschen Haupttypen: SVO als V-mittelständigen Sprachen sowie SOV/VSO als V-randständige Sprachen. Je nachdem wie nichtdeklarative Satze sich aus den zugrundegelegten Strukturen ableiten, erreicht das pronominale Klitikum eine enklitische oder eine postklitische Position,die sich dadurch auszeichnet,daß sie die rechteste diskursfunktionale Themaposition ist. Diese Einsicht macht die strukturelle Verschiebung der schwachen Pronomina systematisch ableit- und voraussagbar.
The aim of this work is to describe criteria used in the process of inclusion and treatment of neologisms in dictionaries of Spanish within the framework of pandemic instability. Our starting point will be data obtained by the Antenas Neológicas Network (https://www.upf.edu/web/antenas), whose representation in three different lexicographic tools will be analyzed with the purpose of identifying problems in the methodology used to dictionarize – that is, how and what words were selected to be included in dictionaries and how they were represented in their entries – neologisms during the COVID-19 pandemic (sources and corpora of analysis, selection criteria, types of definition, among other aspects). Two of them are monolingual and COVID-19 lexical units were included as part of their updates: the Antenario, a dictionary of neologisms of Spanish varieties, and the Diccionario de la Lengua Española [DLE], a dictionary of general Spanish, published by the Real Academia Española [RAE], Spanish Royal Academy). The other is a bilingual unidirectional English-Spanish dictionary first published as a glossary, Diccionario de COVID-19 EN-ES [TREMEDICA], entirely made up of neological and non-neological lexical units related to the virus and the pandemic. Thus, the target lexis was either included in existing works or makes up the whole of a new tool located in a portal together with other lexicographic tools. Unlike other collections of COVID-19 vocabulary that kept cropping up as the pandemic unfolded, all three have been designed and written according to well-established lexicographic practices.
Our working hypothesis is that the need to record and define words which were recently created impacts the criteria for inclusion and treatment of neologisms in dictionaries about Spanish, including a certain degree of overlap of some features which are traditionally thought to be specific to each type of dictionary.
Language attitudes matter; they influence people’s behaviour and decisions. Therefore, it is crucial to learn more about patterns in the way that languages are evaluated. One means of doing so is using a quantitative approach with data representative of a whole population, so that results mirror dispositions at a societal level. This kind of approach is adopted here, with a focus on the situation in Germany. The article consists of two parts. First, I will present some results of a new representative survey on language attitudes in Germany (the Germany Survey 2017). Second, I will show how language attitudes penetrate even seemingly objective data collection processes by examining the German Microcensus. In 2017, for the first time in eighty years, the German Microcensus included a question on language use ‘at home’. Unfortunately, however, the question was clearly tainted by language attitudes instead of being objective. As a result, the Microcensus significantly misrepresents the linguistic reality of different migrant languages spoken in Germany.
Status und Gebrauch des Niederdeutschen 2016. Erste Ergebnisse einer repräsentativen Erhebung
(2016)
Wer versteht heute Plattdeutsch, und wer spricht es? Wer nutzt die plattdeutschen Medien- und Kulturangebote? Welche Vorstellungen verbinden die Menschen in Norddeutschland mit dem Niederdeutschen, und wie stehen sie zu ihrer Regionalsprache? Diesen und weiteren Fragen widmet sich die vorliegende Broschüre mithilfe von repräsentativen Daten, die durch eine telefonische Befragung von insgesamt 1.632 Personen aus acht Bundesländern (Bremen, Hamburg, Mecklenburg-Vorpommern, Niedersachsen, Schleswig-Holstein sowie Brandenburg, Nordrhein-Westfalen und Sachsen-Anhalt) gewonnen wurden.
Öffentliche Sprachdiskurse, wie sie beispielsweise in den Medien stattfinden, werden typischerweise aus einer sprachkritischen Haltung heraus geführt. Inwieweit diese veröffentlichte Meinung tatsächlich die Mehrheitsmeinung der Sprecherinnen und Sprecher widerspiegelt, ist durchaus eine offene Frage. In diesem Beitrag berichten wir aus einer rezenten Erhebung über Spracheinstellungen in Deutschland. Wir zeigen, dass die Art der Frageformulierung einen starken Einfluss auf die Ergebnisse hat, und berichten, welche sprachlichen Veränderungen die Befragten in jüngerer Zeit angeben, wahrgenommen zu haben.