Refine
Year of publication
Document Type
- Book (25) (remove)
Language
- English (18)
- German (5)
- Multiple languages (2)
Keywords
- Korpus <Linguistik> (11)
- Computerlinguistik (5)
- Datenmanagement (5)
- Deutsch (4)
- Forschungsdaten (3)
- Soziolinguistik (3)
- Digital Humanities (2)
- EURALEX (20 : 2022 : Mannheim) (2)
- Grammatik (2)
- Information Retrieval (2)
Publicationstate
- Veröffentlichungsversion (25) (remove)
Reviewstate
- Peer-Review (25) (remove)
Publisher
- IDS-Verlag (3)
- European Language Resources Association (ELRA) (2)
- European language resources association (ELRA) (2)
- Leibniz-Institut für Deutsche Sprache (2)
- Universitätsverlag Hildesheim (2)
- Austrian Academy of Sciences (1)
- De Gruyter (1)
- European Language Resources Association (1)
- Frontiers Media SA (1)
- Gesellschaft für Sprachtechnologie und Computerlinguistik (1)
Die Erforschung von Sprache im öffentlichen Raum (Linguistic Landscapes, LL) hat sich in den vergangen 20 Jahren als Teilgebiet der Soziolinguistik, der Semiotik und anderer Disziplinen fest etabliert. Der vorliegende Band gibt einen Überblick zu zentralen Ansätzen der LL-Forschung mit einem Bezug zur deutschen Sprache. Die Beiträge stellen aktuelle Studien aus dem deutschsprachigen Raum, zu Deutsch als Minderheitensprache sowie aus Ländern mit einer ausgeprägten DaF-Tradition vor. Sie thematisieren sprachstrukturelle und soziolinguistische ebenso wie didaktische, methodische und technologische Aspekte. Damit trägt der Band zu einer Systematisierung der deutschsprachigen LL-Forschung bei, gibt Impulse für internationale Diskussionen und benennt wichtige Desiderata.
In order to satisfy the information needs of a wide range of researchers across a number of disciplines, large textual datasets require careful design, collection, cleaning, encoding, annotation, storage, retrieval, and curation. This daunting set of tasks has coalesced into a number of key themes and questions that are of interest to the contributing research communities: (a) what sampling techniques can we apply? (b) what quality issues should we be aware of? (c) what infrastructures and frameworks are being developed for the efficient storage, annotation, analysis and retrieval of large datasets? (d) what affordances do visualisation techniques offer for the exploratory analysis approaches of corpora? (e) what legal paths can be followed in dealing with IPR and data protection issues governing both the data sources and the query results? (f) how to guarantee that corpus data remain available and usable in a sustainable way?
Contents:
1. Christoph Kuras, Thomas Eckart, Uwe Quasthoff and Dirk Goldhahn: Automation, management and improvement of text corpus production, S. 1
2. Thomas Krause, Ulf Leser, Anke Lüdeling and Stephan Druskat: Designing a re-usable and embeddable corpus search library, S. 6
3. Radoslav Rábara, Pavel Rychlý and Ondřej Herman: Distributed corpus search, S. 10
4. Adrien Barbaresi and Antonio Ruiz Tinoco: Using elasticsearch for linguistic analysis of tweets in time and space, S. 14
5. Marc Kupietz, Nils Diewald and Peter Fankhauser: How to Get the Computation Near the Data: Improving data accessibility to, and reusability of analysis functions in corpus query platforms, S. 20
6. Roman Schneider: Example-based querying for specialist corpora, S. 26
7. Paul Rayson: Increasing interoperability for embedding corpus annotation pipelines in Wmatrix and other corpus retrieval tools, S. 33
Grammar and corpora 2016
(2018)
In recent years, the availability of large annotated and searchable corpora, together with a new interest in the empirical foundation and validation of linguistic theory and description, has sparked a surge of novel and interesting work using corpus-based methods to study the grammar of natural languages. However, a look at relevant current research on the grammar of the Germanic, Romance, and Slavic languages reveals a variety of different theoretical approaches and empirical foci, which can be traced back to different philological and linguistic traditions. Still, this current state of affairs should not be seen as an obstacle but as an ideal basis for a fruitful exchange of ideas between different research paradigms.
How can we measure the impact – such as awareness for economic, ecological, and political matters – of information, such as scientific publications, user-generated content, and reports from the public administration, based on text data? This workshop brings together research from different theoretical paradigms and methodologies for the extraction of impact-relevant indicators from natural language text data and related meta-data. The papers in this workshop represent different types of expertise in different methods for analyzing text data; spanning the whole spectrum of qualitative, quantitative, and mixed methods techniques, as well as domain expertise in the field of impact measurement. The program was built to create an interdisciplinary half-day workshop where we discuss possibilities, limitations, and synergistic effects of different approaches.
The authors are pleased to present to the readers of the Zeitschrift für Sprachwissenschaft a Special Issue in honor of Rosemarie Tracy.
Contents:
0. Frontmatter
1. Petra Schulz, Ira Gawlitzek, Angelika Wöllstein: Introduction, S. 1
2. Natascha Müller: Different sources of delay and acceleration in early child bilingualism, S. 7
3. Hubert Haider, Christina Schörghofer-Essl, Karin Seethaler: Quantifying kids prefer intersecting sets - a pilot study, S. 31
4. Petra Schulz, Rabea Schwarze: How strong is the ban on non-finite verbs in V2? Evidence from early second language learners of German with and without SLI, S. 51
5. Monika Rothweiler, Manuela Schönenberger, Franziska Sterner: Subject-verb agreement in German in bilingual children with and without SLI, S. 79
6. Holger Hopp: The processing of English which-questions in adult L2 learners: Effects of L1 transfer and proficiency, S. 107
7. Oksana Laleko, Maria Polinsky: Silence is difficult: On missing elements in bilingual grammars, S. 135
8. Artemis Alexiadou: Building verbs in language mixing varieties, S. 165
Der Sammelband zur typologisch und kontrastiv vergleichenden grammatischen Erforschung und Beschreibung des Satzanfangs des Deutschen und vier seiner Kontrastsprachen ist ein Ergebnis eines Forschungsnetzwerks, bestehend aus dem Institut für Deutsche Sprache (Mannheim) und Forschergruppen verschiedener europäischer Universitäten. Unter Berücksichtigung insbesondere morphosyntaktischer und informationsstruktureller Aspekte werden die satztopologischen Unterschiede der typologisch recht heterogenen Sprachen bzw. Sprachfamilien unter verschiedenen Gesichtspunkten beleuchtet. Die Untersuchungen werden korpusbasiert durchgeführt, wobei sich die Hälfte der Beiträge auf aufbereitete POS-getaggte Wikipedia-Korpora stützt. Die quantitativ ausgerichteten Korpusanalysen ermöglichen einen genauen Einblick in die unterschiedlichen Strukturmerkmale der betreffenden Sprachen sowie in sprachübergreifende Textmerkmale, und die qualitativen Untersuchungen zeigen Ähnlichkeiten und Abweichungen bei bestimmten Verfahren, die sich morphosyntaktisch iederschlagen und besonders am Satzanfang relevant sind. Insgesamt erlauben die Beiträge Hypothesen zu topologisch und informationsstrukturell markierten Satzanfängen und zu Präferenzen in den jeweiligen Sprachen, aber auch zu möglichen Konstanten und Gemeinsamkeiten, was – auf differenziertere Korpora erweitert – für die Bereiche Sprache und Kognition sowie computergestützte Übersetzung ein großer Gewinn sein dürfte.
Contents:
1. Vasile Pais, Maria Mitrofan, Verginica Barbu Mititelu, Elena Irimia, Roxana Micu and Carol Luca Gasan: Challenges in Creating a Representative Corpus of Romanian Micro-Blogging Text. Pp. 1-7
2. Modest von Korff: Exhaustive Indexing of PubMed Records with Medical Subject Headings. Pp. 8-15
3. Luca Brigada Villa: UDeasy: a Tool for Querying Treebanks in CoNLL-U Format. Pp. 16-19
4. Nils Diewald: Matrix and Double-Array Representations for Efficient Finite State Tokenization. Pp. 20-26
5. Peter Fankhauser and Marc Kupietz: Count-Based and Predictive Language Models for Exploring DeReKo. Pp. 27-31
6. Hanno Biber: “The word expired when that world awoke.” New Challenges for Research with Large Text Corpora and Corpus-Based Discourse Studies in Totalitarian Times. Pp. 32-35
Contents:
1. Julien Abadji, Pedro Javier Ortiz Suárez, Laurent Romary and Benoît Sagot: "Ungoliant: An Optimized Pipeline for the Generation of a Very Large-Scale Multilingual Web Corpus", S.1-9.
2. Markus Gärtner, Felicitas Kleinkopf, Melanie Andresen and Sibylle Hermann: "Corpus Reusability and Copyright - Challenges and Opportunities", S.10-19.
3. Nils Diewald, Eliza Margaretha and Marc Kupietz: "Lessons learned in Quality Management for Online Research Software Tools in Linguistics", S.20-26.
Offensive language in social media is a problem currently widely discussed. Researchers in language technology have started to work on solutions to support the classification of offensive posts. We present the pilot edition of the GermEval Shared Task on the Identification of Offensive Language. This shared task deals with the classification of German tweets from Twitter. GermEval 2018 is the fourth workshop in a series of shared tasks on German processing.
The 12th Web as Corpus workshop (WAC-XII) looks at the past, present, and future of web corpora given the fact that large web corpora are nowadays provided mostly by a few major initiatives and companies, and the diversity of the early years appears to have faded slightly. Also, we acknowledge the fact that alternative sources of data (such as data from Twitter and similar platforms) have emerged, some of them only available to large companies and their affiliates, such as linguistic data from social media and other forms of the deep web. At the same time, gathering interesting and relevant web data (web crawling) is becoming an ever more intricate task as the nature of the data offered on the web changes (for example the death of forums in favour of more closed platforms).
Contents:
1. Andreas Dittrich: Intra-connecting a small exemplary literary corpus with semantic web technologies for exploratory literary studies, S. 1
2. John Kirk, Anna Čermáková: From ICE to ICC: The new International Comparable Corpus, S. 7
3. Dawn Knight, Tess Fitzpatrick, Steve Morris, Jeremy Evas, Paul Rayson, Irena Spasic, Mark Stonelake, Enlli Môn Thomas, Steven Neale, Jennifer Needs, Scott Piao, Mair Rees, Gareth Watkins, Laurence Anthony, Thomas Michael Cobb, Margaret Deuchar, Kevin Donnelly, Michael McCarthy, Kevin Scannell: Creating CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes – The National Corpus of Contemporary Welsh), S. 13
4. Marc Kupietz, Andreas Witt, Piotr Bański, Dan Tufiş, Dan Cristea, Tamás Váradi: EuReCo - Joining Forces for a European Reference Corpus as a sustainable base for cross-linguistic research, S. 15
5. Harald Lüngen, Marc Kupietz: CMC Corpora in DeReKo, S. 20
6. David McClure, Mark Algee-Hewitt, Douris Steele, Erik Fredner, Hannah Walser: Organizing corpora at the Stanford Literary Lab, S. 25
7. Radoslav Rábara, Pavel Rychlý ,Ondřej Herman: Accelerating corpus search using multiple cores, S. 30
8. John Vidler, Stephen Wattam: Keeping Properties with the Data: CL-MetaHeaders – An Open Specification, S. 35
9. Vladimir Benko: Are Web Corpora Inferior? The Case of Czech and Slovak, S. 43
10. Edyta Jurkiewicz-Rohrbacher, Zrinka Kolaković, Björn Hansen: Web Corpora – the best possible solution for tracking phenomena in underresourced languages: clitics in Bosnian, Croatian and Serbian, S. 49
11. Vít Suchomel: Removing Spam from Web Corpora Through Supervised Learning Using FastText, S. 56
The 2014 issue of KONVENS is even more a forum for exchange: its main topic is the interaction between Computational Linguistics and Information Science, and the synergies such interaction, cooperation and integrated views can produce. This topic at the crossroads of different research traditions which deal with natural language as a container of knowledge, and with methods to extract and manage knowledge that is linguistically represented is close to the heart of many researchers at the Institut für Informationswissenschaft und Sprachtechnologie of Universität Hildesheim: it has long been one of the institute’s research topics, and it has received even more attention over the last few years. The main conference papers deal with this topic from different points of view, involving flat as well as deep representations, automatic methods targeting annotation and hybrid symbolic and statistical processing, as well as new Machine Learning-based approaches, but also the creation of language resources for both machines and humans, and methods for testing the latter to optimize their human-machine interaction properties. In line with the general topic, KONVENS-2014 focuses on areas of research which involve this cooperation of information science and computational linguistics: for example learning-based approaches, (cross-lingual) Information Retrieval, Sentiment Analysis, paraphrasing or dictionary and corpus creation, management and usability.
Dieser Band fasst die Vorträge des 9. Hildesheimer Evaluierungs- und Retrieval-Workshops (HIER) zusammen, der am 9. und 10. Juli 2015 an der Universität Hildesheim stattfand. Die HIER Workshop-Reihe begann im Jahr 2001 mit dem Ziel, die Forschungsergebnisse der Hildesheimer Informationswissenschaft zu präsentieren und zu diskutieren. Mittlerweile nehmen immer wieder Kooperationspartner von anderen Institutionen teil, was wir sehr begrüßen. HIER schafft auch ein Forum für Systemvorstellungen und praxisorientierte Beiträge.
The Encyclopedia of Terminology for Conversation Analysis and Interactional Linguistics is an online resource for students and scholars of CA/IL, publicly available on the EMCA Wiki page. Encyclopedias and glossaries are widespread across various fields and methods, and serve as immensely valuable resources. Given the extent to which the EMCA/IL community has expanded over the years—both terminologically as well as geographically—we hope that this encyclopedia of terminology will be well received by students and practitioners of CA and IL across the globe.