Refine
Year of publication
Document Type
- Article (14) (remove)
Has Fulltext
- yes (14)
Keywords
- Computerlinguistik (14) (remove)
Publicationstate
- Veröffentlichungsversion (14) (remove)
Reviewstate
- (Verlags)-Lektorat (6)
- Peer-Review (6)
- Peer-Revied (1)
- Peer-review (1)
Linguistische Studien arbeiten häufig mit einer Differenzierung zwischen gesprochener und geschriebener Sprache bzw. zwischen Kommunikation der Nähe und Distanz. Die Annahme eines Kontinuums zwischen diesen Polen bietet sich für eine Verortung unterschiedlichster Äußerungsformen an, inklusive unkonventioneller Textsorten wie etwa Popsongs. Wir konzipieren, implementieren und evaluieren ein automatisiertes Verfahren, das mithilfe unkorrelierter Entscheidungsbäume entsprechende Vorhersagen auf Textebene durchführt. Für die Identifizierung der Pole definieren wir einen Merkmalskatalog aus Sprachphänomenen, die als Markierer für Nähe/Mündlichkeit bzw. Distanz/Schriftlichkeit diskutiert werden, und wenden diesen auf prototypische Nähe-/Mündlichkeitstexte sowie prototypische Distanz-/Schrifttexte an. Basierend auf der sehr guten Klassifikationsgüte verorten wir anschließend eine Reihe weiterer Textsorten mithilfe der trainierten Klassifikatoren. Dabei erscheinen Popsongs als „mittige Textsorte“, die linguistisch motivierte Merkmale unterschiedlicher Kontinuumsstufen vereint. Weiterhin weisen wir nach, dass unsere Modelle mündlich kommunizierte, aber vorab oder nachträglich verschriftlichte Äußerungen wie Reden oder Interviews vollkommen anders verorten als prototypische Gesprächsdaten und decken Klassifikationsunterschiede für Social-Media-Varianten auf. Ziel ist dabei nicht eine systematisch-verbindliche Einordung im Kontinuum, sondern eine empirische Annäherung an die Frage, welche maschinell vergleichsweise einfach bestimmbaren Merkmale („shallow features“) nachweisbar Einfluss auf die Verortung haben.
So far, there have been few descriptions on creating structures capable of storing lexicographic data, ISO 24613:2008 being one of the latest. Another one is by Spohr (2012), who designs a multifunctional lexical resource which is able to store data of different types of dictionaries in a user-oriented way. Technically, his design is based on the principle of a hierarchical XML/OWL (eXtensible Markup Language/Web Ontology Language) representation model. This article follows another route in describing a model based on entities and relations between them; MySQL (usually referred to as: Structured Query Language) describes a database system of tables containing data and definitions of relations between them. The model was developed in the context of the project "Scientific eLexicography for Africa" and the lexicographic database to be built thereof will be implemented with MySQL. The principles of the ISO model and of Spohr's model are adhered to with one major difference in the implementation strategy: we do not place the lemma in the centre of attention, but the sense description — all other elements, including the lemma, depend on the sense description. This article also describes the contained lexicographic data sets and how they have been collected from different sources. As our aim is to compile several prototypical internet dictionaries (a monolingual Northern Sotho dictionary, a bilingual learners' Xhosa–English dictionary and a bilingual Zulu–English dictionary), we describe the necessary microstructural elements for each of them and which principles we adhere to when designing different ways of accessing them. We plan to make the model and the (empty) database with all graphical user interfaces that have been developed, freely available by mid-2015.
In this Paper, we describe a schema and models which have been developed for the representation of corpora of computer-mediated communicatin (CMC corpora) using the representation framework provided by the Text Encoding Initiative (TEI). We characterise CMC discourse as dialogic, sequentially organised interchange between humans and point out that many features of CMC are not adequately handled by current corpus encoding schemas and tools. We formulate desiderata for a representation of CMC in encoding schemes and argue why the TEI is a suitable framework for the encoding of CMC corpora. We propose a model of basic CMC units (utterances, posts, and nonverbal activities) and the macro- and micro-level structures of interactions in CMC environments. Based on these models, we introduce CMC-core, a TEI customisation for the encoding of CMC corpora, which defines CMC-specific encoding features on the four levels of elements, model classes, attribute classes, and modules of the TEI infrastructure. The description of our customisation is illustrated by encoding examples from corpora by researchers of the TEI SIG CMC, representing a variety of CMC genres, i.e. chat, wiki talk, twitter, blog, and Second Life interactions. The material described, i.e. schemata, encoding examples, and documentation, is available from the of the TEI CMC SIG Wiki and will accompany a feature request to the TEI council in late 2019.
Editorial
(2020)
In 1959, Lucien Tesnière wrote his main work Éléments de syntaxe structurale. While the impact on theoretical linguistics was not very strong at first, 50 years later there exist a variety of linguistic theories based on Tesnière's work. In computational linguistics, as in theoretical linguistics, dependency grammar was not very influential at first. The last 10–15 years, however, have brought a noticeable change and dependency grammar has found its way into computational linguistics. Syntactically annotated corpora based on dependency representations are available for a variety of languages, as well as statistical parsers which give a syntactic analysis of running text describing the underlying dependency relations between word tokens in the text. This article gives an overview of relevant areas of computational linguistics which have been influenced by dependency grammar. It discusses the pros and cons of different types of syntactic representation used in natural language processing and their suitability as representations of meaning. Finally, an attempt is made to give an outlook on the future impact of dependency grammar on computational linguistics.
Vorwort
(2008)
Vorwort
(2010)
GAIS – GesprächsAnalytisches InformationsSystem. Ein hypermediales Lernsystem zur Gesprächsforschung
(2002)
Der vorliegende Beitrag stellt das vom BMBF geförderte Projekt GAIS (GesprächsAnalytisches InformationsSystem) vor, welches am Institut für Deutsche Sprache (IDS) entwickelt wird. GAIS ist ein hypermediales und didaktisch aufbereitetes Lernsystem zur Gesprächsanalyse für Einsteiger und Experten. Durch die unterschiedlichen Schwierigkeitsgrade ist es sowohl für Lehrende als auch für Lernende konzipiert. GAIS bietet eine Plattform, die Theorie, Praxis, Beispiele, Links, Anwendungen und Literatur zur Gesprächsforschung bündelt. Nutzer können diese Informationen rezipieren, ihre Kenntnisse überprüfen und mit technischen Werkzeugen eigene Analysen erstellen.
We present a morphological analyzer for Spanish called SMM. SMM is implemented in the grammar development framework Malaga, which is based on the formalism of Left-Associative Grammar. We briefly present the Malaga framework, describe the implementation decisions for some interesting morphological phenomena of Spanish, and report on the evaluation results from the analysis of corpora. SMM was originally only designed for analyzing word forms; in this article we outline two approaches for using SMM and the facilities provided by Malaga to also generate verbal paradigms. SMM can also be embedded into applications by making use of the Malagaprogramming interface; we briefly discuss some application scenarios.