TY - JOUR U1 - Zeitschriftenartikel, wissenschaftlich - begutachtet (reviewed) A1 - Klosa, Annette A1 - Kupietz, Marc A1 - Lüngen, Harald T1 - Zum Nutzen von Korpusauszeichnungen für die Lexikographie JF - Lexicographica : international annual for lexicography KW - Deutsch KW - OWID KW - elexiko KW - Wörterbuchkorpus KW - Dereko Y1 - 2012 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-8554 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-8554 UR - 10.1515/lexi.2012-0006 SN - 0175-6206 SS - 0175-6206 N1 - Dieser Beitrag ist mit Zustimmung des Rechteinhabers aufgrund einer (DFG geförderten) Allianz- bzw. Nationallizenz frei zugänglich. VL - 28 SP - 71 EP - 97 S1 - 27 PB - de Gruyter CY - Berlin ER - TY - JOUR U1 - Zeitschriftenartikel, wissenschaftlich - begutachtet (reviewed) A1 - Kunze, Claudia A1 - Lemnitzer, Lothar A1 - Lüngen, Harald A1 - Storrer, Angelika T1 - Repräsentation und Verknüpfung allgemeinsprachlicher und terminologischer Wortnetze in OWL JF - Zeitschrift für Sprachwissenschaft : Organ der Deutschen Gesellschaft für Sprachwissenschaft N2 - This paper describes an approach to modelling a general-language wordnet, GermaNet, and a domain-specific wordnet, TermNet, in the web ontology language OWL. While the modelling process for GermaNet adopts relevant recommendations with respect to the English Princeton WordNet, for Term-Net an alternative modelling concept is developed that considers the special characteristics of domain-specific terminologies. We present a proposal for linking a general-language wordnet and a terminological wordnet within the framework of OWL and on this basis discuss problems and alternative modelling approaches. KW - Deutsch KW - Wortnetz KW - wordnets KW - ontologies KW - terminologies KW - text technology KW - Terminologie KW - Linguistische Datenverarbeitung KW - Semantisches Netz KW - Englisch Y1 - 2007 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-8621 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-8621 UR - http://www.degruyter.com/dg/viewarticle.fullcontentlink:pdfeventlink/$002fj$002fzfsw.2007.26.issue-2$002fzfs.2007.020$002fzfs.2007.020.xml?format=INT&t:ac=j$002fzfsw.2007.26.issue-2$002fzfs.2007.020$002fzfs.2007.020.xml SN - 1613-3706 SS - 1613-3706 N1 - Dieser Beitrag ist mit Zustimmung des Rechteinhabers aufgrund einer (DFG geförderten) Allianz- bzw. Nationallizenz frei zugänglich. VL - 26 IS - 2 SP - 267 EP - 290 S1 - 24 PB - de Gruyter CY - Berlin ER - TY - CHAP U1 - Buchbeitrag A1 - Lobin, Henning A1 - Lüngen, Harald A1 - Hilbert, Mirco A1 - Bärenfänger, Maja ED - Mehler, Alexander ED - Kühnberger, Kai-Uwe ED - Lobin, Henning ED - Lüngen, Harald ED - Storrer, Angelika ED - Witt, Andreas T1 - Processing Text-Technological Resources in Discourse Parsing T2 - Modeling, Learning, and Processing of Text-Technological Data Structures N2 - Discourse parsing of complex text types such as scientific research articles requires the analysis of an input document on linguistic and structural levels that go beyond traditionally employed lexical discourse markers. This chapter describes a text-technological approach to discourse parsing. Discourse parsing with the aim of providing a discourse structure is seen as the addition of a new annotation layer for input documents marked up on several linguistic annotation levels. The discourse parser generates discourse structures according to the Rhetorical Structure Theory. An overview of the knowledge sources and components for parsing scientific joumal articles is given. The parser’s core consists of cascaded applications of the GAP, a Generic Annotation Parser. Details of the chart parsing algorithm are provided, as well as a short evaluation in terms of comparisons with reference annotations from our corpus and with recently developed Systems with a similar task. T3 - Studies in Computational Intelligence - 370 KW - Computerlinguistik KW - Textanalyse KW - Parsing Y1 - 2011 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-48317 SN - 978-3-642-22612-0 (Print) SB - 978-3-642-22612-0 (Print) SN - 978-3-642-22613-7 (Online) SB - 978-3-642-22613-7 (Online) U6 - https://dx.doi.org/10.1007/978-3-642-22613-7_3 DO - https://dx.doi.org/10.1007/978-3-642-22613-7_3 N1 - The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-642-22613-7_3 SP - 35 EP - 58 PB - Springer CY - Berlin/Heidelberg ER - TY - CHAP U1 - Buchbeitrag A1 - Lüngen, Harald A1 - Beißwenger, Michael A1 - Selzam, Bianca A1 - Storrer, Angelika ED - Mehler, Alexander ED - Kühnberger, Kai-Uwe ED - Lobin, Henning ED - Lüngen, Harald ED - Storrer, Angelika ED - Witt, Andreas T1 - Modelling and Processing Wordnets in OWL T2 - Modelling, Learning, and Processing of Text-Technological Data Structures N2 - In this contribution, we discuss and compare alternative options of modelling the entities and relations of wordnet-like resources in the Web Ontology Language OWL. Based on different modelling options, we developed three models of representing wordnets in OWL, i.e. the instance model, the dass model, and the metaclass model. These OWL models mainly differ with respect to the ontological Status of lexical units (word senses) and the synsets. While in the instance model lexical units and synsets are represented as individuals, in the dass model they are represented as classes; both model types can be encoded in the dialect OWL DL. As a third alternative, we developed a metaclass model in OWL FULL, in which lexical units and synsets are defined as metaclasses, the individuals of which are classes themselves. We apply the three OWL models to each of three wordnet-style resources: (1) a subset of the German wordnet GermaNet, (2) the wordnet-style domain ontology TermNet, and (3) GermaTermNet, in which TermNet technical terms and GermaNet synsets are connected by means of a set of “plug-in” relations. We report on the results of several experiments in which we evaluated the performance of querying and processing these different models: (1) A comparison of all three OWL models (dass, instance, and metaclass model) of TermNet in the context of automatic text-to-hypertext conversion, (2) an investigation of the potential of the GermaTermNet resource by the example of a wordnet-based semantic relatedness calculation. T3 - Studies in Computational Intelligence - 370 KW - Semantisches Netz KW - Deutsch KW - Computerlinguistik Y1 - 2011 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-48322 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-48322 SN - 978-3-642-22612-0 SB - 978-3-642-22612-0 SP - 347 EP - 376 PB - Springer CY - Berlin/Heidelberg ER - TY - CHAP U1 - Buchbeitrag A1 - Lüngen, Harald A1 - Hebborn, Mariana ED - Klawitter, Jana ED - Lobin, Henning ED - Schmidt, Torben T1 - Linguistische Annotationen für die Analyse von Gliederungsstrukturen wissenschaftlicher Texte T2 - Kulturwissenschaften Digital. Neue Forschungsfragen und Methoden KW - Korpus KW - Annotation KW - Ontologie Y1 - 2012 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-47959 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-47959 SN - 978-3-593-41287-0 SB - 978-3-593-41287-0 SP - 155 EP - 176 PB - Campus CY - Frankfurt am Main ER - TY - JOUR U1 - Zeitschriftenartikel, wissenschaftlich - begutachtet (reviewed) A1 - Bärenfänger, Maja A1 - Binder, Frank A1 - Lobin, Henning A1 - Lüngen, Harald A1 - Stührenberg, Maik T1 - Editorial JF - Journal for Language Technology and Computational Linguistics KW - Computerlinguistik KW - Studium KW - E-Learning Y1 - 2011 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-47970 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-47970 UR - http://www.jlcl.org/index.php?modus=ausgaben&language= SN - 2190-6858 SS - 2190-6858 VL - 26 IS - 1 SP - V EP - V PB - GSCL CY - Regensburg ER - TY - JOUR U1 - Zeitschriftenartikel, wissenschaftlich - begutachtet (reviewed) A1 - Binder, Frank A1 - Lüngen, Harald A1 - Lobin, Henning T1 - Sprachressourcen in der Lehre – Erfahrungen, Einsatzszenarien, Nutzerwünsche JF - Journal for Language Technology and Computational Linguistics KW - Computerlinguistik KW - Korpus KW - Studium Y1 - 2011 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-47980 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-47980 UR - http://www.jlcl.org/index.php?modus=ausgaben&language= SN - 2190-6858 SS - 2190-6858 VL - 26 IS - 1 SP - 53 EP - 65 PB - GSCL CY - Regensburg ER - TY - CHAP U1 - Buchbeitrag A1 - Lüngen, Harald A1 - Bärenfänger, Maja A1 - Hilbert, Mirco A1 - Lobin, Henning A1 - Puskás, Csilla ED - Witt, Andreas ED - Dieter, Metzing T1 - Discourse Relations and Document Structure T2 - Linguistic Modeling of Information and Markup Languages. Contributions to Language Technology N2 - This chapter addresses the requirements and linguistic foundations of automatic relational discourse analysis of complex text types such as scientific journal articles. It is argued that besides lexical and grammatical discourse markers, which have traditionally been employed in discourse parsing, cues derived from the logical and generical document structure and the thematic structure of a text must be taken into account. An approach to modelling such types of linguistic information in terms of XML-based multi-layer annotations and to a text-technological representation of additional knowledge sources is presented. By means of quantitative and qualitative corpus analyses, cues and constraints for automatic discourse analysis can be derived. Furthermore, the proposed representations are used as the input sources for discourse parsing. A short overview of the projected parsing architecture is given. T3 - Text, Speech and Language Technology - 41 KW - Discourse parsing KW - Discourse relations KW - Document structure KW - Text technology KW - Linguistic annotations KW - XML Y1 - 2010 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-48005 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-48005 SN - 978-90-481-3330-7 SB - 978-90-481-3330-7 N1 - The final publication is available at Springer via https://dx.doi.org/10.1007/978-90-481-3331-4 SP - 97 EP - 123 PB - Springer CY - Dordrecht ER - TY - CHAP U1 - Konferenzveröffentlichung A1 - Bayerl, Petra Saskia A1 - Lüngen, Harald A1 - Goecke, Daniela A1 - Witt, Andreas A1 - Naber, Daniel ED - Roisin, Cécile ED - Munson, Ethan ED - Vanoirbeek, Christine T1 - Methods for the semantic analysis of document markup T2 - Proceedings of the ACM Symposium on Document Engineering (DocEng 2003) N2 - We present an approach on how to investigate what kind of semantic information is regularly associated with the structural markup of scientific articles. This approach addresses the need for an explicit formal description of the semantics of text-oriented XML-documents. The domain of our investigation is a corpus of scientific articles from psychology and linguistics from both English and German online available journals. For our analyses, we provide XML-markup representing two kinds of semantic levels: the thematic level (i.e. topics in the text world that the article is about) and the functional or rhetorical level. Our hypothesis is that these semantic levels correlate with the articles’ document structure also represented in XML. Articles have been annotated with the appropriate information. Each of the three informational levels is modelled in a separate XML document, since in our domain, the different description levels might conflict so that it is impossible to model them within a single XML document. For comparing and mining the resulting multi-layered XML annotations of one article, a Prolog-based approach is used. It focusses on the comparison of XML markup that is distributed among different documents. Prolog predicates have been defined for inferring relations between levels of information that are modelled in separate XML documents. We demonstrate how the Prolog tool is applied in our corpus analyses. KW - XML KW - Semantic Analysis KW - Prolog KW - Information Extraction Y1 - 2003 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-48014 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-48014 SP - 161 EP - 170 PB - ACM CY - New York ER - TY - CHAP U1 - Buchbeitrag A1 - Lüngen, Harald A1 - Keibel, Holger ED - Steffens, Doris ED - al-Wadi, Doris T1 - Zur Erstellung und Interpretation der Zeitverlaufsgrafiken T2 - Neuer Wortschatz. Neologismen im Deutschen 2001-2010. Band 2: kiten – Z KW - Deutsch KW - Neologismus KW - Korpus KW - Methode Y1 - 2013 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-47870 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-47870 SN - 978-3-937241-43-2 SB - 978-3-937241-43-2 SP - 561 EP - 567 PB - Institut für Deutsche Sprache CY - Mannheim ET - 1. Auflage ER - TY - CHAP U1 - Buchbeitrag A1 - Lüngen, Harald A1 - Keibel, Holger ED - Steffens, Doris ED - al-Wadi, Doris T1 - Zur Erstellung und Interpretation der Zeitverlaufsgrafiken T2 - Neuer Wortschatz. Neologismen im Deutschen 2001-2010. Band 2: kiten – Z KW - Deutsch KW - Neologismus KW - Korpus KW - Methode Y1 - 2014 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-47888 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-47888 SN - 978-3-937241-43-2 SB - 978-3-937241-43-2 SP - 561 EP - 567 PB - Institut für Deutsche Sprache CY - Mannheim ET - 2., durchgesehene Auflage ER - TY - CHAP U1 - Buchbeitrag A1 - Lüngen, Harald A1 - Keibel, Holger ED - Steffens, Doris ED - al-Wadi, Doris T1 - Zur Erstellung und Interpretation der Zeitverlaufsgrafiken T2 - Neuer Wortschatz. Neologismen im Deutschen 2001-2010. Band 2: kiten – Z KW - Deutsch KW - Neologismus KW - Korpus KW - Methode Y1 - 2015 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-47893 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-47893 SN - 978-3-937241-43-2 SB - 978-3-937241-43-2 SP - 561 EP - 567 PB - Institut für Deutsche Sprache CY - Mannheim ET - 3., durchgesehene Auflage ER - TY - CHAP U1 - Konferenzveröffentlichung A1 - Sperberg-McQueen, Christopher M. A1 - Schonefeld, Oliver A1 - Kupietz, Marc A1 - Lüngen, Harald A1 - Witt, Andreas T1 - Igel: Comparing document grammars using XQuery T2 - Proceedings of Balisage. The Markup Conference 2013 N2 - Igel is a small XQuery-based web application for examining a collection of document grammars; in particular, for comparing related document grammars to get a better overview of their differences and similarities. In its initial form, Igel reads only DTDs and provides only simple lists of constructs in them (elements, attributes, notations, parameter entities). Our continuing work is aimed at making Igel provide more sophisticated and useful information about document grammars and building the application into a useful tool for the analysis (and the maintenance!) of families of related document grammars T3 - Balisage Series on Markup Technologies - 10 KW - Korpus KW - XML KW - XQuery Y1 - 2013 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-47919 SN - 1947-2609 SS - 1947-2609 U6 - https://dx.doi.org/10.4242/BalisageVol10.Schonefeld01 DO - https://dx.doi.org/10.4242/BalisageVol10.Schonefeld01 SP - ungezählte Seiten S1 - 6 ER - TY - CHAP U1 - Konferenzveröffentlichung A1 - Beißwenger, Michael A1 - Chanier, Thierry A1 - Chiari, Isabella A1 - Erjavec, Tomaž A1 - Fišer, Darja A1 - Herold, Axel A1 - Ljubešić, Nikola A1 - Lüngen, Harald A1 - Poudat, Céline A1 - Stemle, Egon W. A1 - Storrer, Angelika A1 - Wigham, Ciara ED - Borin, Lars T1 - Integrating corpora of computer-mediated communication into the language resources landscape: Initiatives and best practices from French, German, Italian and Slovenian projects T2 - Proceedings of the 5th CLARIN Annual Conference. Aix-en-Provence, France. 26–28 October, 2016 N2 - The paper presents best practices and results from projects in four countries dedicated to the creation of corpora of computer-mediated communication and social media interactions (CMC). Even though there are still many open issues related to building and annotating corpora of that type, there already exists a range of accessible solutions which have been tested in projects and which may serve as a starting point for a more precise discussion of how future standards for CMC corpora may (and should) be shaped like. KW - Computerunterstützte Kommunikation KW - Korpus KW - computer-mediated communication (CMC) KW - social media interaction Y1 - 2016 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-58053 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-58053 UR - https://www.clarin.eu/news/call-papers-clarin-annual-conference-2016 SP - 5 S1 - 5 PB - CLARIN CY - Utrecht ER - TY - CHAP U1 - Konferenzveröffentlichung A1 - Beißwenger, Michael A1 - Ehrhardt, Eric A1 - Herold, Axel A1 - Lüngen, Harald A1 - Storrer, Angelika ED - Resch, Claudia ED - Hannesschläger, Vanessa ED - Wissik, Tanja T1 - Converting and Representing Social Media Corpora into TEI: Schema and best practices from CLARIN-D T2 - TEI Conference and Members' Meeting 2016. Book of Abstracts N2 - The paper presents results from a curation project within CLARIN-D, in which an existing lMWord corpus of German chat communication has been integrated into the DEREKO and DWDS corpus infrastructures of the CLARIN-D centres at the Institute for the German Language (IDS, Mannheim) and at the Berlin-Brandenburg Academy of Sciences (BBAW, Berlin). The focus is on the solutions developed for converting and representing the corpus in a TEI format. KW - Deutsch KW - Chatten KW - Korpus Y1 - 2016 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-55736 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-55736 UR - http://tei2016.acdh.oeaw.ac.at/sites/default/files/TEIconf2016_BookOfAbstracts.pdf SN - 978-3-200-04689-4 SB - 978-3-200-04689-4 SP - 39 EP - 41 PB - Austrian Centre for Digital Humanities, Austrian Academy of Sciences CY - Wien ER - TY - CHAP U1 - Buchbeitrag A1 - Lüngen, Harald A1 - Beißwenger, Michael A1 - Ehrhardt, Eric A1 - Herold, Axel A1 - Storrer, Angelika ED - Dipper, Stefanie ED - Neubarth, Friedrich ED - Zinsmeister, Heike T1 - Integrating corpora of computer-mediated communication in CLARIN-D: Results from the curation project ChatCorpus2CLARIN T2 - Proceedings of the 13th Conference on Natural Language Processing (KONVENS) N2 - We introduce our pipeline to integrate CMC and SM corpora into the CLARIN-D corpus infrastructure. The pipeline was developed by transforming an existing CMC corpus, the Dortmund Chat Corpus, into a resource conforming to current technical and legal standards. We describe how the resource has been prepared and restructured in terms of TEI encoding, linguistic annotations, and anonymisation. The output is a CLARIN-conformant resource integrated in the CLARIN-D research infrastructure. T3 - Bochumer Linguistische Arbeitsberichte - 16 KW - Deutsch KW - Chatten KW - Korpus KW - Text Encoding Initiative (TEI) Y1 - 2016 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-55743 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-55743 UR - https://www.linguistics.ruhr-uni-bochum.de/bla/ SN - 2190-0949 SS - 2190-0949 SP - 156 EP - 164 PB - Sprachwissenschaftliches Institut, Ruhr-Universität Bochum CY - Bochum ER - TY - CHAP U1 - Konferenzveröffentlichung A1 - Beißwenger, Michael A1 - Herold, Axel A1 - Lüngen, Harald A1 - Storrer, Angelika T1 - Das Dortmunder Chat-Korpus in CLARIN-D: Modellierung und Mehrwerte T2 - DHd 2016. Modellierung - Vernetzung - Visualisierung. Die Digital Humanities als fächerübergreifendes Forschungsparadigma. Konferenzabstracts KW - Deutsch KW - Chatten KW - Korpus KW - CLARIN-D Y1 - 2016 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-55788 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-55788 UR - http://dhd2016.de/boa.pdf SN - 978-3-941379-05-3 SB - 978-3-941379-05-3 SP - 274 EP - 277 PB - nisaba CY - Duisburg ER - TY - CHAP U1 - Konferenzveröffentlichung A1 - Beißwenger, Michael A1 - Ehrhardt, Eric A1 - Herold, Axel A1 - Lüngen, Harald A1 - Storrer, Angelika ED - Fišer, Darja ED - Beißwenger, Michael T1 - (Best) Practices for Annotating and Representing CMC and Social Media Corpora in CLARIN-D T2 - Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities N2 - The paper reports the results of the curation project ChatCorpus2CLARIN. The goal of the project was to develop a workflow and resources for the integration of an existing chat corpus into the CLARIN-D research infrastructure for language resources and tools in the Humanities and the Social Sciences (http://clarin-d.de). The paper presents an overview of the resources and practices developed in the project, describes the added value of the resource after its integration and discusses, as an outlook, to what extent these practices can be considered best practices which may be useful for the annotation and representation of other CMC and social media corpora. KW - CMC corpora KW - TEI encoding KW - tagging KW - corpus infrastructures KW - legal issues KW - Korpus KW - Chatten KW - Deutsch KW - Text Encoding Initiative (TEI) Y1 - 2016 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-55810 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-55810 UR - http://nl.ijs.si/janes/wp-content/uploads/2016/09/CMC-conference-proceedings-2016.pdf SN - 978-961-237-859-2 SB - 978-961-237-859-2 SP - 7 EP - 11 PB - Academic Publishing Division of the Faculty of Arts of the University of Ljubljana CY - Ljubljana ER - TY - CHAP U1 - Konferenzveröffentlichung A1 - Beißwenger, Michael A1 - Chanier, Thierry A1 - Chiari, Isabella A1 - Erjavec, Tomaž A1 - Fišer, Darja A1 - Herold, Axel A1 - Ljubešić, Nikola A1 - Lüngen, Harald A1 - Poudat, Céline A1 - Stemle, Egon W. A1 - Storrer, Angelika A1 - Wigham, Ciara ED - De Smedt, Koenraad ED - Odijk, Jan ED - Mörth, Karlheinz T1 - Integrating corpora of computer-mediated communication into the language resources landscape: Initiatives and best practices from French, German, Italian and Slovenian projects T2 - Proceedings CLARIN Annual Conference 2016 N2 - The paper presents best practices and results from projects in four countries dedicated to the creation of corpora of computer-mediated communication and social media interactions (CMC). Even though there are still many open issues related to building and annotating corpora of that type, there already exists a range of accessible solutions which have been tested in projects and which may serve as a starting point for a more precise discussion of how future standards for CMC corpora may (and should) be shaped like. KW - Korpus KW - Computerunterstützte Kommunikation Y1 - 2016 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-55836 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-55836 SP - 5 S1 - 5 PB - Clarin CY - Aix-en-Provence ER - TY - JOUR U1 - Zeitschriftenartikel, wissenschaftlich - begutachtet (reviewed) A1 - Witt, Andreas A1 - Goecke, Daniela A1 - Sasaki, Felix A1 - Lüngen, Harald T1 - Unification of XML Documents with Concurrent Markup JF - Literary and Linguistic Computing N2 - An approach to the unification of XML (Extensible Markup Language) documents with identical textual content and concurrent markup in the framework of XML-based multi-layer annotation is introduced. A Prolog program allows the possible relationships between element instances on two annotation layers that share PCDATA to be explored and also the computing of a target node hierarchy for a well-formed, merged XML document. Special attention is paid to identity conflicts between element instances, for which a default solution that takes into account metarelations that hold between element types on the different annotation layers is provided. In addition, rules can be specified by a user to prescribe how identity conflicts should be solved for certain element types. KW - XML (Extensible Markup Language) KW - Information Retrieval Y1 - 2005 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-45269 SN - 1477-4615 SS - 1477-4615 U6 - https://dx.doi.org/10.1093/llc/fqh046 DO - https://dx.doi.org/10.1093/llc/fqh046 N1 - Dieser Beitrag ist mit Zustimmung des Rechteinhabers aufgrund einer (DFG geförderten) Allianz- bzw. Nationallizenz frei zugänglich. This publication is with permission of the rights owner freely accessible due to an Alliance licence and a national licence (funded by the DFG, German Research Foundation) respectively. VL - 20 IS - 1 SP - 103 EP - 116 PB - Oxford University Press CY - Oxford ER - TY - CHAP U1 - Konferenzveröffentlichung A1 - Goecke, Daniela A1 - Lüngen, Harald A1 - Sasaki, Felix A1 - Witt, Andreas A1 - Farrar, Scott T1 - GOLD and Discourse: Domain- and Community-Specific Extensions T2 - Proceedings of the E-MELD Workshop on Morphosyntactic Annotation and Terminology: Linguistic Ontologies and Data Categories for Language Resources KW - Korpus KW - Interoperabilität KW - Auszeichnungssprache KW - Standardisierung Y1 - 2005 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-45321 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-45321 UR - http://emeld.org/workshop/2005/papers/goecke-paper.html SP - 8 S1 - 8 PB - E-MELD CY - Boston ER - TY - CHAP U1 - Buchbeitrag A1 - Goecke, Daniela A1 - Lüngen, Harald A1 - Metzing, Dieter A1 - Stührenberg, Maik A1 - Witt, Andreas ED - Witt, Andreas ED - Metzing, Dieter T1 - Different Views on Markup BT - Linguistic Modeling of Information and Markup Languages. Contributions to Language Technology N2 - In this chapter, two different ways of grouping information represented in document markup are examined: annotation levels, referring to conceptual levels of description, and annotation layers, referring to the technical realisation of markup using e.g. document grammars. In many current XML annotation projects, multiple levels are integrated into one layer, often leading to the problem of having to deal with overlapping hierarchies. As a solution, we propose a framework for XML-based multiple, independent XML annotation layers for one text, based on an abstract representation of XML documents with logical predicates. Two realisations of the abstract representation are presented, a Prolog fact base format together with an application architecture, and a specification for XML native databases. We conclude with a discussion of projects that have currently adopted this framework. T3 - Text, Speech and Language Technology - 41 KW - Computerlinguistik KW - XML KW - Annotation KW - Concurrent markup KW - XML (Extensible Markup Language) KW - Annotations Y1 - 2010 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-44966 SN - 978-90-481-3330-7 SB - 978-90-481-3330-7 U6 - https://dx.doi.org/10.1007/978-90-481-3331-4_1 DO - https://dx.doi.org/10.1007/978-90-481-3331-4_1 SP - 1 EP - 22 PB - Springer CY - Dordrecht ER - TY - CHAP U1 - Konferenzveröffentlichung A1 - Witt, Andreas A1 - Lüngen, Harald A1 - Sasaki, Felix A1 - Goecke, Daniela T1 - Unification of XML Documents with Concurrent Markup T2 - ALLCACH2004, Joint Conference of the ALLC and ACH, Göteborg KW - Computerlinguistik KW - Auszeichnungssprache Y1 - 2004 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-45391 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-45391 UR - http://www.w3.org/People/fsasaki/docs/ SP - 5 S1 - 5 PB - University of Göteborg CY - Göteborg ER - TY - CHAP U1 - Buchbeitrag A1 - Belica, Cyril A1 - Kupietz, Marc A1 - Witt, Andreas A1 - Lüngen, Harald ED - Konopka, Marek ED - Kubczak, Jacqueline ED - Mair, Christian ED - Šticha, František ED - Waßner, Ulrich Hermann T1 - The Morphosyntactic Annotation of DeReKo: Interpretation, Opportunities, and Pitfalls T2 - Grammatik und Korpora 2009. Dritte Internationale Konferenz. Mannheim, 22.-24.9.2009 T2 - Grammar & Corpora 2009. Third International Conference. Mannheim, 22.-24.9.2009 N2 - The paper discusses from various angles the morphosyntactic annotation of DeReKo, the Archive of General Reference Corpora of Contemporary Written German at the Institut für Deutsche Sprache (IDS), Mannheim. The paper is divided into two parts. The first part covers the practical and technical aspects of this endeavor. We present results from a recent evaluation of tools for the annotation of German text resources that have been applied to DeReKo. These tools include commercial products, especially Xerox' Finite State Tools and the Machinese products developed by the Finnish company Connexor Oy, as well as software for which academic licenses are available free of charge for academic institutions, e.g. Helmut Schmid's Tree Tagger. The second part focuses on the linguistic interpretability of the corpus annotations and more general methodological considerations concerning scientifically sound empirical linguistic research. The main challenge here is that unlike the texts themselves, the morphosyntactic annotations of DeReKo do not have the status of observed data; instead they constitute a theory and implementation-dependent interpretation. In addition, because of the enormous size of DeReKo, a systematic manual verification of the automatic annotations is not feasible. In consequence, the expected degree of inaccuracy is very high, particularly wherever linguistically challenging phenomena, such as lexical or grammatical variation, are concerned. Given these facts, a researcher using the annotations blindly will run the risk of not actually studying the language but rather the annotation tool or the theory behind it. The paper gives an overview of possible pitfalls and ways to circumvent them and discusses the opportunities offered by using annotations in corpus-based and corpus-driven grammatical research against the background of a scientifically sound methodology. T3 - Korpuslinguistik und interdisziplinäre Perspektiven auf Sprache - 1 KW - Korpus KW - Annotation KW - Schriftsprache KW - Deutsches Referenzkorpus (DeReKo) KW - Institut für Deutsche Sprache Y1 - 2009 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-44890 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-44890 SN - 978-3-8233-6648-5 SB - 978-3-8233-6648-5 SP - 451 EP - 469 PB - Narr CY - Tübingen ER - TY - CHAP U1 - Buchbeitrag A1 - Mehler, Alexander A1 - Kühnberger, Kai-Uwe A1 - Lobin, Henning A1 - Lüngen, Harald A1 - Storrer, Angelika A1 - Witt, Andreas ED - Mehler, Alexander ED - Kühnberger, Kai-Uwe ED - Lobin, Henning ED - Lüngen, Harald ED - Storrer, Angelika ED - Witt, Andreas T1 - Introduction: Modeling, Learning and Processing of Text-Technological Data Structures T2 - Modeling, Learning, and Processing of Text-Technological Data Structures N2 - Researchers in many disciplines, sometimes working in close cooperation, have been concerned with modeling textual data in order to account for texts as the prime information unit of written communication. The list of disciplines includes computer science and linguistics as well as more specialized disciplines like computational linguistics and text technology. What many of these efforts have in common is the aim to model textual data by means of abstract data types or data structures that support at least the semi-automatic processing of texts in any area of written communication. T3 - Studies in Computational Intelligence - 370 KW - Datenstruktur KW - Dokumentenverarbeitung KW - Textverarbeitung KW - Natürliche Sprache KW - Ontologie KW - Maschinelles Lernen KW - Syntaktische Analyse KW - Semantische Relation Y1 - 2011 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-45067 SN - 978-3-642-22612-0 SB - 978-3-642-22612-0 U6 - https://dx.doi.org/10.1007/978-3-642-22613-7_1 DO - https://dx.doi.org/10.1007/978-3-642-22613-7_1 N1 - The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-642-22613-7_1 SP - 1 EP - 11 PB - Springer CY - Berlin/Heidelberg ER - TY - CHAP U1 - Konferenzveröffentlichung A1 - Lüngen, Harald A1 - Witt, Andreas ED - Opas-Hänninen, Lisa Lena ED - Jokelainen, Mikko ED - Juuso, Ilkka ED - Seppänen, Tapio T1 - Multi-Dimensional Markup: N-way relations as a generalisation over possible relations between annotation layers T2 - Digital Humanities 2008. Book of Abstracts KW - Digital Humanities KW - Auszeichnungssprache KW - Annotation Y1 - 2008 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-45104 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-45104 UR - http://www.ekl.oulu.fi/dh2008/ SN - 978-951-42-8838-8 SB - 978-951-42-8838-8 SP - 254 EP - 255 PB - University of Oulu CY - Oulu ER - TY - CHAP U1 - Konferenzveröffentlichung A1 - Witt, Andreas A1 - Lüngen, Harald A1 - Gibbon, Dafydd T1 - Enhancing speech corpus resources with multiple lexical tag layers T2 - Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC-2000). Athen, Griechenland N2 - We describe a general two-stage procedure for re-using a custom corpus for spoken language system development involving a transformation from character-based markup to XML, and DSSSL stylesheet-driven XML markup enhancement with multiple lexical tag trees. The procedure was used to generate a fully tagged corpus; alternatively with greater economy of computing resources, it can be employed as a parametrised ‘tagging on demand’ filter. The implementation will shortly be released as a public resource together with the corpus (German spoken dialogue, about 500k word form tokens) and lexicon (about 75k word form types). KW - DSSSL KW - Morphology KW - Speech Corpora KW - Speech Lexica KW - Text Technology KW - XML Y1 - 2000 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-45517 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-45517 UR - http://lrec-conf.org/proceedings/lrec2000/ SP - 5 S1 - 5 PB - European Language Resources Association (ELRA) CY - Paris ER - TY - CHAP U1 - Buchbeitrag A1 - Lüngen, Harald A1 - Puskás, Csilla A1 - Bärenfänger, Maja A1 - Hilbert, Mirco A1 - Lobin, Henning ED - Pahikkala, Tapio ED - Pyysalo, Sampo ED - Ginter, Filip ED - Salakoski, Tapio T1 - Discourse segmentation of German written texts T2 - Advance in natural language processing. 5th International Conference on NLP FinTAL 2006 Turku, Finnland, August 23-25 N2 - Discourse segmentation is the division of a text into minimal discourse segments, which form the leaves in the trees that are used to represent discourse structures. A definition of elementary discourse segments in German is provided by adapting widely used segmentation principles for English minimal units, while considering punctuation, morphology, sytax, and aspects of the logical document structure of a complex text type, namely scientific articles. The algorithm and implementation of a discourse segmenter based on these principles is presented, as well an evaluation of test runs. KW - Computerlinguistik KW - Diskursanalyse KW - Automatische Sprachanalyse KW - Computational linguistics KW - Discourse annotation KW - Tag KW - Annotation KW - Discourse analysis Y1 - 2006 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-23 SN - 978-3-540-37334-6 SB - 978-3-540-37334-6 U6 - https://dx.doi.org/10.1007/11816508_26 DO - https://dx.doi.org/10.1007/11816508_26 N1 - The final publication is available at Springer via http://dx.doi.org/10.1007/11816508_26 SP - 245 EP - 256 S1 - 12 PB - Springer-Verlag CY - Berlin [u.a.] ER - TY - CHAP U1 - Konferenzveröffentlichung A1 - Lüngen, Harald A1 - Baerenfaenger, Maja A1 - Hilbert, Mirco A1 - Lobin, Henning A1 - Puskás, Csilla ED - Dobreva, Milena ED - Martens, Bob T1 - Text parsing of a complex genre T2 - ELPUB 2006. Digital Spectrum: Integrating Technology and Culture - Proceedings of the 10th International Conference on Electronic Publishing held in Bansko. ELPUB 2006, Bansko, Bulgaria, June 14-16 N2 - A text parsing component designed to be part of a system that assists students in academic reading an writing is presented. The parser can automatically add a relational discourse structure annotation to a scientific article that a user wants to explore. The discourse structure employed is defined in an XML format and is based the Rhetorical Structure Theory. The architecture of the parser comprises pre-processing components which provide an input text with XML annotations on different linguistic and structural layers. In the first version these are syntactic tagging, lexical discourse marker tagging, logical document structure, and segmentation into elementary discourse segments. The algorithm is based on the shift-reduce parser by Marcu (2000) and is controlled by reduce operations that are constrained by linguistic conditions derived from an XML-encoded discourse marker lexicon. The constraints are formulated over multiple annotation layers of the same text. KW - Textanalyse ; Diskursanalyse ; Computerlinguistik KW - text parsing KW - discourse parsing KW - XML applications KW - rhetorical structure Y1 - 2006 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-84 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-84 UR - http://www.uni-giessen.de/germanistik/ascl/dfg-projekt/pdfs/242_elpub2006.published-version.pdf SN - 978-954-16-0040-5 SB - 978-954-16-0040-5 SP - 247 EP - 256 S1 - 10 PB - Foi-Commerce CY - Sofia ER - TY - CHAP U1 - Konferenzveröffentlichung A1 - Langer, Hagen A1 - Lüngen, Harald A1 - Bayerl, Petra Saskia T1 - Text type structure and logical document structure T2 - Proceedings of the ACL-workshop on discource annotation N2 - Most research on automated categorization of documents has concentrated on the assignment of one or many categories to a whole text. However, new applications, e.g. in the area of the Semantic Web, require a richer and more fine-grained annotation of documents, such as detailed thematic information about the parts of a document. Hence we investigate the automatic categorization of text segments of scientific articles with XML markup into 16 topic types from a text type structure schema. A corpus of 47 linguistic articles was provided with XML markup on different annotation layers representing text type structure, logical document structure, and grammatical categories. Six different feature extraction strategies were applied to this corpus and combined in various parametrizations in different classifiers. The aim was to explore the contribution of each type of information, in particular the logical structure features, to the classification accuracy. The results suggest that some of the topic types of our hierarchy are successfully learnable, while the features from the logical structure layer had no particular impact on the results. KW - Computerlinguistik ; Texttypus Y1 - 2004 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-92 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-92 VL - 2004 ER - TY - JOUR U1 - Zeitschriftenartikel, wissenschaftlich - begutachtet (reviewed) A1 - Lüngen, Harald A1 - Sperberg-McQueen, Christopher M. T1 - A TEI P5 Document Grammar for the IDS Text Model JF - Journal of the Text Encoding Initiative N2 - This paper describes work in progress on I5, a TEI-based document grammar for the corpus holdings of the Institut für Deutsche Sprache (IDS) in Mannheim and the text model used by IDS in its work. The paper begins with background information on the nature and purposes of the corpora collected at IDS and the motivation for the I5 project (section 1). It continues with a description of the origin and history of the IDS text model (section 2), and a description (section 3) of the techniques used to automate, as far as possible, the preparation of the ODD file documenting the IDS text model. It ends with some concluding remarks (section 4). A survey of the additional features of the IDS-XCES realization of the IDS text model is given in an appendix. KW - Korpuslinguistik KW - Institut für Deutsche Sprache KW - Deutsch KW - Standardisierung Y1 - 2012 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-2367 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-2367 UR - http://jtei.revues.org/508 IS - 3 SP - 1 EP - 18 S1 - 18 ER - TY - CHAP U1 - Konferenzveröffentlichung A1 - Beißwenger, Michael A1 - Lüngen, Harald A1 - Margaretha, Eliza A1 - Pölitz, Christian ED - Faaß, Gertrud ED - Ruppenhofer, Josef T1 - Mining corpora of computer-mediated communication: analysis of linguistic features in Wikipedia talk pages using machine learning methods T2 - Proceedings of the 12th edition of the KONVENS conference Vol. 1 N2 - Machine learning methods offer a great potential to automatically investigate large amounts of data in the humanities. Our contribution to the workshop reports about ongoing work in the BMBF project KobRA (http://www.kobra.tu-dortmund.de) where we apply machine learning methods to the analysis of big corpora in language-focused research of computer-mediated communication (CMC). At the workshop, we will discuss first results from training a Support Vector Machine (SVM) for the classification of selected linguistic features in talk pages of the German Wikipedia corpus in DeReKo provided by the IDS Mannheim. We will investigate different representations of the data to integrate complex syntactic and semantic information for the SVM. The results shall foster both corpus-based research of CMC and the annotation of linguistic features in CMC corpora. KW - Korpus KW - Textkorpus KW - Computerlinguistik KW - Data Mining KW - internetbasierte Kommunikation KW - Korpuslinguistik KW - KorAP (Korpusanalyseplattform der nächsten Generation) Y1 - 2014 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:gbv:hil2-opus-2893 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:gbv:hil2-opus-2893 SP - 42 EP - 47 PB - Universität Hildesheim CY - Hildesheim ER - TY - CHAP U1 - Buchbeitrag A1 - Kupietz, Marc A1 - Belica, Cyril A1 - Lüngen, Harald A1 - Perkuhn, Rainer ED - Plewnia, Albrecht ED - Witt, Andreas T1 - Zwischen Empirie und Ästhetik – Ansätze zur korpuslinguistischen Untersuchung und Bewertung von Sprachwandel T2 - Sprachverfall? Dynamik – Wandel – Variation N2 - Der Beitrag beschäftigt sich mit der Frage, wie und inwieweit korpusbasierte Ansätze zur Untersuchung und Bewertung von Sprachwandel beitragen können. Die Bewertung von Sprachwandel erscheint in dieser Hinsicht interessant, da sie erstens von größerem öffentlichen Interesse ist, zweitens nicht zu den Kernthemen der Sprachwissenschaft zählt und drittens sowohl die geisteswissenschaftlichen Aspekte der Sprachwissenschaft berührt als auch die empirischen, die eher für die so genannten harten Wissenschaften typisch sind. Letzteres trifft bei der Frage nach Sprachverfall (gutem vs. schlechtem Deutsch diachron) vermutlich unbestrittener zu als bei der Frage nach richtigem vs. falschem Deutsch, da zu ihrer Beantwortung offensichtlich einerseits empirische, messbare Kriterien herangezogen werden müssen, andererseits aber auch weitere Kriterien notwendig sind und es außerdem einer Entscheidung zur Einordnung und Gewichtung der verschiedenartigen Kriterien sowie einer Begründung dieser Entscheidung bedarf. Zur Annäherung an die Fragestellung werden zunächst gängige, leicht operationalisierbare Hypothesen zu Symptomen eines potenziellen Verfalls des Deutschen auf verschiedenen DeReKo-basierten Korpora überprüft und im Hinblick auf ihre Verallgemeinerbarkeit und Tragweite diskutiert. Im zweiten Teil werden weitere empirische Ansätze zur Untersuchung von Wandel, Variation und Dynamik skizziert, die zur Diskussion spezieller Aspekte von Sprachverfall beitragen könnten. Im Schlussteil werden die vorgestellten Ansätze in den Gesamtkontext einer sprachwissenschaftlichen Untersuchung von Sprachverfall gestellt und vor dem Hintergrund seines gesellschaftlichen Diskurses reflektiert. T3 - Jahrbuch / Institut für Deutsche Sprache - 2013. KW - Korpus KW - Sprachverfall Y1 - 2014 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-31091 SN - 978-3-11-034291-8 SB - 978-3-11-034291-8 U6 - https://dx.doi.org/10.1515/9783110343007.149 DO - https://dx.doi.org/10.1515/9783110343007.149 SP - 149 EP - 169 PB - de Gruyter CY - Berlin/Boston ET - Zweitveröffentlichung ER - TY - CHAP U1 - Konferenzveröffentlichung A1 - Kupietz, Marc A1 - Lüngen, Harald T1 - Recent developments in DeReKo T2 - Proceedings of the ninth conference on international language resources and evaluation (LREC’14) N2 - This paper gives an overview of recent developments in the German Reference Corpus DeReKo in terms of growth, maximising relevant corpus strata, metadata, legal issues, and its current and future research interface. Due to the recent acquisition of new licenses, DeReKo has grown by a factor of four in the first half of 2014, mostly in the area of newspaper text, and presently contains over 24 billion word tokens. Other strata, like fictional texts, web corpora, in particular CMC texts, and spoken but conceptually written texts have also increased significantly. We report on the newly acquired corpora that led to the major increase, on the principles and strategies behind our corpus acquisition activities, and on our solutions for the emerging legal, organisational, and technical challenges. KW - Deutsches Referenzkorpus (DeReKo) KW - Institut für Deutsche Sprache KW - Deutsch KW - Korpus KW - Textkorpus Y1 - 2014 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-31353 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-31353 UR - http://www.lrec-conf.org/proceedings/lrec2014/index.html SP - 2385 S1 - 2378 PB - European Language Resources Association (ELRA) CY - Reykjavik ER - TY - CHAP U1 - Buchbeitrag A1 - Perkuhn, Rainer A1 - Belica, Cyril A1 - Keibel, Holger A1 - Kupietz, Marc A1 - Lüngen, Harald ED - Dominguez Vázquez, Maria José ED - Eichinger, Ludwig M. T1 - Valenz und Kookkurrenz T2 - Valenz im Fokus. Grammatische und lexikografische Studien; Festschrift für Jacqueline Kubczak KW - Deutsch KW - Verb KW - Valenz KW - Korpus KW - Distribution KW - Kookkurrenzanalyse Y1 - 2015 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-35442 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-35442 SN - 978-3-937241-49-4 SB - 978-3-937241-49-4 SP - 175 EP - 196 PB - Institut für Deutsche Sprache CY - Mannheim ER - TY - JOUR U1 - Zeitschriftenartikel, wissenschaftlich - begutachtet (reviewed) A1 - Margaretha, Eliza A1 - Lüngen, Harald ED - Beißwenger, Michael ED - Storrer, Angelika ED - Oostdijk, Nelleke ED - van den Heuvel, Henk T1 - Building linguistic corpora from Wikipedia articles and discussions JF - Journal of Language Technology and Computational Linguistics. Special issue on building and annotating corpora of computer-mediated communication. Issues and challenges at the interface between computational and corpus linguistics N2 - Wikipedia is a valuable resource, useful as a lingustic corpus or a dataset for many kinds of research. We built corpora from Wikipedia articles and talk pages in the I5 format, a TEI customisation used in the German Reference Corpus (Deutsches Referenzkorpus - DeReKo). Our approach is a two-stage conversion combining parsing using the Sweble parser, and transformation using XSLT stylesheets. The conversion approach is able to successfully generate rich and valid corpora regardless of languages. We also introduce a method to segment user contributions in talk pages into postings. KW - Wikipedia KW - Korpus KW - Computerlinguistik KW - Deutsches Referenzkorpus (DeReKo) Y1 - 2014 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-33306 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-33306 VL - 29 IS - 2 SP - 59 EP - 82 ER - TY - JOUR U1 - Zeitschriftenartikel, wissenschaftlich - begutachtet (reviewed) A1 - Lüngen, Harald T1 - DEREKO - Das Deutsche Referenzkorpus. Schriftkorpora der deutschen Gegenwartssprache am Institut für Deutsche Sprache in Mannheim JF - Zeitschrift für germanistische Linguistik KW - Deutsch KW - Korpus KW - Deutsches Referenzkorpus (DeReKo) KW - Institut für Deutsche Sprache Y1 - 2017 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-61065 SN - 1613-0626 SS - 1613-0626 U6 - https://dx.doi.org/10.1515/zgl-2017-0008 DO - https://dx.doi.org/10.1515/zgl-2017-0008 N1 - Dieser Beitrag ist mit Zustimmung des Rechteinhabers aufgrund einer (DFG geförderten) Allianz- bzw. Nationallizenz frei zugänglich. This publication is with permission of the rights owner freely accessible due to an Alliance licence and a national licence (funded by the DFG, German Research Foundation) respectively. VL - 45 IS - 1 SP - 161 EP - 170 PB - de Gruyter CY - Berlin/New York ER - TY - CHAP U1 - Konferenzveröffentlichung A1 - Lüngen, Harald A1 - Kupietz, Marc ED - Bański, Piotr ED - Kupietz, Marc ED - Lüngen, Harald ED - Rayson, Paul ED - Biber, Hanno ED - Breiteneder, Evelyn ED - Clematide, Simon ED - Mariani, John ED - Stevenson, Mark ED - Sick, Theresa T1 - CMC Corpora in DeReKo T2 - Proceedings of the Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing (CMLC-5+BigNLP) 2017 including the papers from the Web-as-Corpus (WAC-XI) guest section. Birmingham, 24 July 2017 N2 - We introduce three types of corpora of computer-mediated communication that have recently been compiled at the Institute for the German Language or curated from an external project and included in DeReKo, the German Reference Corpus, namely Wikipedia (discussion) corpora, the Usenet news corpus, and the Dortmund Chat Corpus. The data and corpora have been converted to I5, the TEI customization to represent texts in DeReKo, and are researchable via the web-based IDS corpus research interfaces and in the case of Wikipedia and chat also downloadable from the IDS repository and download server, respectively. KW - Korpus KW - Deutsch KW - Internet KW - Wikipedia KW - UseNet KW - Deutsches Referenzkorpus (DeReKo) KW - Dortmunder Chat-Korpus KW - Corpus linguistics KW - Computer-mediated communication KW - CMC corpus Y1 - 2017 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-62592 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-62592 SP - 20 EP - 24 S1 - 5 PB - Institut für Deutsche Sprache CY - Mannheim ER - TY - CHAP U1 - Konferenzveröffentlichung A1 - Beißwenger, Michael A1 - Chanier, Thierry A1 - Erjavec, Tomaž A1 - Fišer, Darja A1 - Herold, Axel A1 - Ljubešić, Nikola A1 - Lüngen, Harald A1 - Poudat, Céline A1 - Stemle, Egon W. A1 - Storrer, Angelika A1 - Wigham, Ciara ED - Borin, Lars T1 - Closing a Gap in the Language Resources Landscape: Groundwork and Best Practices from Projects on Computer-mediated Communication in four European Countries T2 - Selected papers from the CLARIN Annual Conference 2016, Aix-en-Provence, 26–28 October 2016, CLARIN Common Language Resources and Technology Infrastructure N2 - The paper presents best practices and results from projects dedicated to the creation of corpora of computer-mediated communication and social media interactions (CMC) from four different countries. Even though there are still many open issues related to building and annotating corpora of this type, there already exists a range of tested solutions which may serve as a starting point for a comprehensive discussion on how future standards for CMC corpora could (and should) be shaped like. T3 - Linköping Electronic Conference Proceedings - 136 KW - Computerunterstützte Kommunikation KW - Korpus KW - Texttechnologie KW - Annotation Y1 - 2017 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-62534 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-62534 UR - http://www.ep.liu.se/ecp/contents.asp?issue=136 SN - 1650-3740 SS - 1650-3740 SN - 978-91-7685-499-0 SB - 978-91-7685-499-0 SP - 1 EP - 18 PB - Linköping University CY - Linköping ER - TY - CHAP U1 - Buchbeitrag A1 - Kupietz, Marc A1 - Lüngen, Harald A1 - Bański, Piotr A1 - Belica, Cyril T1 - Maximizing the potential of very large corpora: 50 years of big language data at IDS Mannheim T2 - Proceedinigs of the LREC-2014-workshop challenges in the management of large corpora (CMLC2) N2 - Very large corpora have been built and used at the IDS since its foundation in 1964. They have been made available on the Internet since the beginning of the 90’s to currently over 30,000 researchers worldwide. The Institute provides the largest archive of written German (Deutsches Referenzkorpus, DeReKe) which has recently been extended to 24 billion words. DeReKe has been managed and analysed by engines known as COSMAS and afterwards COSMAS II, which is currently being replaced by a new, scalable analysis platform called KorAP. KorAP makes it possible to manage and analyse texts that are accompanied by multiple, potentially conflicting, grammatical and structural annotation layers, and is able to handle resources that are distributed across different, and possibly geographically distant, storage systems. The majority of texts in DeReKe are not licensed for free redistribution, hence, the COSMAS and KorAP systems offer technical solutions to facilitate research on very large corpora that are not available (and not suitable) for download. For the new KorAP system, it is also planned to provide sandboxed environments to support non-remote-API access “near the data” through which users can run their own analysis programs. KW - Deutsch KW - Korpus KW - Textkorpus KW - very large corpora KW - scalability KW - big data Y1 - 2014 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-31634 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-31634 UR - http://www.lrec-conf.org/proceedings/lrec2014/workshops/LREC2014Workshop-CMLC2%20Proceedings-rev2.pdf SP - 1 EP - 6 PB - ELRA CY - Reykjavik ER - TY - CHAP U1 - Konferenzveröffentlichung A1 - Schröck, Jasmin A1 - Lüngen, Harald ED - Beißwenger, Michael ED - Zesch, Torsten T1 - Building and Annotating a Corpus of German-Language Newsgroups T2 - NLP4CMC 2015. 2nd Workshop on Natural Language Processing for Computer-Mediated Communication / Social Media. Proceedings of the Workshop , September 29, 2015 University of Duisburg-Essen, Campus Essen N2 - Usenet is a large online resource containing user-generated messages (news articles) organised in discussion groups (newsgroups) which deal with a wide variety of different topics. We describe the download, conversion, and annotation of a comprehensive German news corpus for integration in DeReKo, the German Reference Corpus hosted at the Institut für Deutsche Sprache in Mannheim. KW - Korpus KW - Annotation KW - Deutsches Referenzkorpus (DeReKo) KW - Textkorpus Y1 - 2015 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-43640 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-43640 SP - 17 EP - 22 PB - German Society for Computational Linguistics & Language Technology (GSCL) ER - TY - CHAP U1 - Konferenzveröffentlichung A1 - Beißwenger, Michael A1 - Ehrhardt, Eric A1 - Horbach, Andrea A1 - Lüngen, Harald A1 - Steffen, Diana A1 - Storrer, Angelika ED - Beißwenger, Michael ED - Zesch, Torsten T1 - Adding Value to CMC Corpora: CLARINification and Part-of-speech Annotation of the Dortmund Chat Corpus T2 - NLP4CMC 2015. 2nd Workshop on Natural Language Processing for Computer-Mediated Communication / Social Media. Proceedings of the Workshop, September 29, 2015 University of Duisburg-Essen, Campus Essen KW - Korpus KW - Computerunterstützte Kommunikation KW - Deutsch KW - Elektronisches Forum KW - Dortmunder Chat-Korpus KW - Annotation Y1 - 2015 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-43654 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-43654 SP - 12 EP - 16 PB - German Society for Computational Linguistics & Language Technology (GSCL) ER - TY - CHAP U1 - Buchbeitrag A1 - Hilbert, Mirco A1 - Lüngen, Harald A1 - Bärenfänger, Maja A1 - Lobin, Henning ED - Storrer, Angelika ED - Geyken, Alexander ED - Siebert, Alexander ED - Würzner, Kay-Michael T1 - Demonstration des SemDok-Textparsers T2 - KONVENS 2008 – Ergänzungsband. Textressourcen und lexikalisches Wissen N2 - Im Teilprojekt CI “SemDok” der DFG-Forschergruppe Texttechnologische Informationsmodellierung wurde ein Textparser für Diskursstrukturen wissenschaftlicher Zeitschriftenartikel nach der Rhetorical Structure Theory entwickelt. Die wesentlichen konzeptuellen und technischen Merkmale des Chart-Parsers und die sich daraus ergebenden Parametrisierungsmöglichkeiten für Parsing-Experimente werden beschrieben. Zudem wird HPVtz., ein Tool für die Visualisierung von Parsing-Ergebnissen (RST-Bäume in einer XML-Anwendung) und die Navigation in ihnen, vorgestellt. KW - Automatische Sprachanalyse KW - Computerlinguistik KW - Syntaktische Analyse Y1 - 2008 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-43032 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-43032 SN - 978-3-00-025611-0 SB - 978-3-00-025611-0 SP - 21 EP - 27 PB - BBAW CY - Berlin ER - TY - CHAP U1 - Konferenzveröffentlichung A1 - Lüngen, Harald A1 - Gibbon, Dafydd ED - Gippert, Jost ED - Olivier, Peter T1 - Consistent Vocabularies for Spoken Language Machine Translation Systems T2 - GLDV '99. Multilinguale Corpora: Codierung, Strukturierung, Analyse. 11. Jahrestagung der Gesellschaft für Linguistische DatenVerarbeitung. 8.-10.7.1999, Frankfurt a/M KW - Gesprochene Sprache KW - Maschinelle Übersetzung KW - Korpus KW - Wortschatz Y1 - 1999 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-74583 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-74583 UR - http://titus.uni-frankfurt.de/curric/gldv99/paper/luengen/gibblux.pdf SP - 169 EP - 178 PB - enigma corporation CY - Prag ER - TY - CHAP U1 - Buchbeitrag A1 - Kupietz, Marc A1 - Lüngen, Harald A1 - Kamocki, Paweł A1 - Witt, Andreas ED - Calzolari, Nicoletta ED - Choukri, Khalid ED - Cieri, Christopher ED - Declerck, Thierry ED - Goggi, Sara ED - Hasida, Koiti ED - Isahara, Hitoshi ED - Maegaard, Bente ED - Mariani, Joseph ED - Mazo, Hélène ED - Moreno, Asuncion ED - Odijk, Jan ED - Piperidis, Stelios ED - Tokunaga, Takenobu T1 - The German reference corpus DeReKo: new developments – new opportunities T2 - Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018), 7-12 May 2018, Miyazaki, Japan N2 - This paper discusses current trends in DeReKo, the German Reference Corpus, concerning legal issues around the recent German copyright reform with positive implications for corpus building and corpus linguistics in general, recent corpus extensions in the genres of popular magazines, journals, historical texts, and web-based football reports. Besides, DeReKo is finally accessible via the new corpus research platform KorAP, offering registered users several news features in comparison with its predecessor COSMAS II. KW - Korpus KW - Deutsch Y1 - 2018 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-74917 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-74917 UR - http://www.lrec-conf.org/proceedings/lrec2018/summaries/737.html SN - 979-10-95546-00-9 SB - 979-10-95546-00-9 SP - 4354 EP - 4360 PB - European language resources association (ELRA) CY - Paris, France ER - TY - CHAP U1 - Konferenzveröffentlichung A1 - Lüngen, Harald A1 - Beißwenger, Michael A1 - Herzberg, Laura A1 - Pichler, Cathrin ED - Stemle, Egon W. ED - Wigham, Ciara R. T1 - Anonymisation of the Dortmund Chat Corpus 2.1 T2 - Proceedings of the 5th Conference on CMC and Social Media Corpora for the Humanities (cmccorpora17), 3-4 October 2017, Eurac Research, Italy N2 - As a consequence of a recent curation project, the Dortmund Chat Corpus is available in CLARIN-D research infrastructures for download and querying. In a legal expertise it had been recommended that standard measures of anonymisation be applied to the corpus before its republication. This paper reports about the anonymisation campaign that was conducted for the corpus. Anonymisation has been realised as categorisation, and the taxonomy of anonymisation categories applied is introduced and the method of applying it to the TEI files is demonstrated. The results of the anonymisation campaign as well as issues of quality assessment are discussed. Finally, pseudonymisation as an alternative to categorisation as a method of the anonymisation of CMC data is discussed, as well as possibilities of an automatisation of the process. KW - corpora KW - computer-mediated communication KW - anonymisation KW - Korpus KW - Computerunterstützte Kommunikation KW - Anonymisierung KW - Chatten Y1 - 2017 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-66392 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-66392 UR - https://cmc-corpora2017.eurac.edu/proceedings/cmccorpora17-proceedings.pdf SP - 21 EP - 24 CY - Bolzano ET - First edition ER - TY - CHAP U1 - Konferenzveröffentlichung A1 - Beißwenger, Michael A1 - Wigham, Ciara A1 - Etienne, Carole A1 - Fišer, Darja A1 - Grumt Suárez, Holger A1 - Herzberg, Laura A1 - Hinrichs, Erhard A1 - Horsmann, Tobias A1 - Karlova-Bourbonus, Natali A1 - Lemnitzer, Lothar A1 - Longhi, Julien A1 - Lüngen, Harald A1 - Ho-Dac, Lydia-Mai A1 - Parisse, Christophe A1 - Poudat, Céline A1 - Schmidt, Thomas A1 - Stemle, Egon W. A1 - Storrer, Angelika A1 - Zesch, Torsten ED - Stemle, Egon W. ED - Wigham, Ciara R. T1 - Connecting resources: Which issues have to be solved to integrate CMC corpora from heterogeneous sources and for different languages? T2 - Proceedings of the 5th Conference on CMC and Social Media Corpora for the Humanities (cmccorpora17). 3-4 October 2017, Eurac Research, Italy N2 - The paper reports on the results of a scientific colloquium dedicated to the creation of standards and best practices which are needed to facilitate the integration of language resources for CMC stemming from different origins and the linguistic analysis of CMC phenomena in different languages and genres. The key issue to be solved is that of interoperability – with respect to the structural representation of CMC genres, linguistic annotations metadata, and anonymization/pseudonymization schemas. The objective of the paper is to convince more projects to partake in a discussion about standards for CMC corpora and for the creation of a CMC corpus infrastructure across languages and genres. In view of the broad range of corpus projects which are currently underway all over Europe, there is a great window of opportunity for the creation of standards in a bottom-up approach. KW - corpora KW - research infrastructures KW - annotation KW - anonymization KW - Korpus KW - Annotation KW - Anonymisierung Y1 - 2017 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-66401 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-66401 UR - https://cmc-corpora2017.eurac.edu/proceedings/cmccorpora17-proceedings.pdf SP - 52 EP - 55 CY - Bolzano ET - First edition ER - TY - CHAP U1 - Buchbeitrag A1 - Beißwenger, Michael A1 - Lüngen, Harald A1 - Schallaböck, Jan A1 - Weitzmann, John H. A1 - Herold, Axel A1 - Kamocki, Paweł A1 - Storrer, Angelika A1 - Wildgans, Julia ED - Beißwenger, Michael T1 - Rechtliche Bedingungen für die Bereitstellung eines Chat-Korpus in CLARIN-D. Ergebnisse eines Rechtsgutachtens T2 - Empirische Erforschung internetbasierter Kommunikation T3 - Empirische Linguistik - 9 KW - Korpus KW - Gesprochene Sprache KW - Internet KW - Rechtsstatus Y1 - 2017 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-66494 SN - 978-3-11-056614-7 SB - 978-3-11-056614-7 U6 - https://dx.doi.org/10.1515/9783110567786-002 DO - https://dx.doi.org/10.1515/9783110567786-002 SP - 7 EP - 46 PB - De Gruyter CY - Berlin [u.a.] ER - TY - CHAP U1 - Buchbeitrag A1 - Klosa, Annette A1 - Lüngen, Harald ED - Čibej, Jaka ED - Gorjanc, Vojko ED - Kosem, Iztok ED - Krek, Simon T1 - New German words: detection and description T2 - Proceedings of the XVIII EURALEX International Congress Lexicography in Global Contexts 17-21 July 2018, Ljubljana N2 - In this paper, we discuss an efficient method of (semi-automatic) neologism detection for German and its application for the production of a dictionary of neologisms, focusing on the lexicographic process. By monitoring the language via editorial (print and online) media evaluation and interpreting the findings on the basis of lexicographic competence, many, but not all neologisms can be identified which qualify for inclusion in the Neologismenworterbuch (2006-today) at the Institute for the German Language in Mannheim (IDS). In addition, an automated corpus linguistic method offers neologism candidates based on a systematic analysis of large amounts of text to lexicographers. We explain the principles of the corpus linguistic compilation of a list of candidates and show how lexicographers work with the results, combining them with their own findings in order to continuously enlarge this specialized online dictionary of new words in German. KW - detection of neologisms KW - description of neologisms KW - corpus linguistics KW - lexicography KW - Korpus KW - Neologismus KW - Lexikogaphie Y1 - 2018 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-77188 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-77188 UR - http://euralex2018.cjvt.si/wp-content/uploads/sites/6/2018/07/Euralex_2018.pdf SP - 559 EP - 569 PB - Znanstvena založba Filozofske fakultete Univerze v Ljubljani / Ljubljana University Press, Faculty of Arts CY - Ljubljana ER - TY - CHAP U1 - Konferenzveröffentlichung A1 - Bärenfänger, Maja A1 - Hilbert, Mirco A1 - Lobin, Henning A1 - Lüngen, Harald T1 - Using OWL ontologies in discourse parsing T2 - Ontologies in Text Technology: Approaches to Extract Semantic Knowledge from Structured Information KW - Diskursanalyse KW - Parser KW - Strukturbaum KW - Textverarbeitung KW - XML (Extensible Markup Language) KW - OWL-Ontology Y1 - 2007 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-77234 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-77234 UR - https://portal.ikw.uni-osnabrueck.de/~ott06/ott06-abstracts/baerenfaenger_etal_abstract.pdf SP - 1 EP - 6 PB - Institut für Kognitionswissenschaft Universität Osnabrück CY - Osnabrück ER - TY - RPRT U1 - Arbeitspapier A1 - Witt, Andreas A1 - Lüngen, Harald A1 - Gibbon, Dafydd T1 - Standardisierung orthographischer Transkriptionen: Ein SGML/TEI-basierter Vorschlag für VERBMOBIL (VM-Memo 117) T2 - Verbmobil KW - SGML KW - Text Encoding Initiative KW - Transkription KW - Gesprochene Sprache Y1 - 1997 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-75945 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-75945 SP - 1 EP - 26 PB - Universität Bielefeld CY - Bielefeld ER - TY - CHAP U1 - Konferenzveröffentlichung A1 - Kupietz, Marc A1 - Margaretha, Eliza A1 - Diewald, Nils A1 - Lüngen, Harald A1 - Fankhauser, Peter ED - Bański, Piotr ED - Barbaresi, Adrien ED - Biber, Hanno ED - Breiteneder, Evelyn ED - Clematide, Simon ED - Kupietz, Marc ED - Lüngen, Harald ED - Iliadi, Caroline T1 - What's New in EuReCo? Interoperability, Comparable Corpora, Licensing T2 - Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019 N2 - This paper reports on the latest developments of the European Reference Corpus EuReCo and the German Reference Corpus in relation to three of the most important CMLC topics: interoperability, collaboration on corpus infrastructure building, and legal issues. Concerning interoperability, we present new ways to access DeReKo via KorAP on the API and on the plugin level. In addition we report about advancements in the EuReCo- and ICC-initiatives with the provision of comparable corpora, and about recent problems with license acquisitions and our solution approaches using an indemnification clause and model licenses that include scientific exploitation. KW - corpus linguistics KW - corpus management KW - corpus processing KW - comparable corpora KW - Korpus Y1 - 2019 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-90261 U6 - https://dx.doi.org/10.14618/ids-pub-9026 DO - https://dx.doi.org/10.14618/ids-pub-9026 SP - 33 EP - 39 PB - Leibniz-Institut für Deutsche Sprache CY - Mannheim ER - TY - CHAP U1 - Konferenzveröffentlichung A1 - Lüngen, Harald A1 - Lobin, Henning T1 - Extracting domain knowledge from tables of contents T2 - Digital Humanities 2010. Conference Abstracts. King’s College London, London July 7 – 10, 2010 N2 - Knowledge in textual form is always presented as visually and hierarchically structured units of text, which is particularly true in the case of academic texts. One research hypothesis of the ongoing project Knowledge ordering in texts - text structure and structure visualisations as sources of natural ontologies1 is that the textual structure of academic texts effectively mirrors essential parts of the knowledge structure that is built up in the text. The structuring of a modern dissertation thesis (e.g. in the form of an automatically generated table of contents - toes), for example, represents a compromise between requirements of the text type and the methodological and conceptual structure of its subject-matter. The aim of the project is to examine how visual-hierarchical structuring systems are constructed, how knowledge structures are encoded in them, and how they can be exploited to automatically derive ontological knowledge for navigation, archiving, or search tasks. The idea to extract domain concepts and semantic relations mainly from the structural and linguistic information gathered from tables of contents represents a novel approach to ontology learning. KW - Wissensrepräsentation KW - Ontologie KW - Semantische Relation KW - Visualisierung Y1 - 2010 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-76096 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-76096 SN - 978-0-9565793-0-0 SB - 978-0-9565793-0-0 SP - 331 PB - Office for Humanities Communication; Centre for Computing in the Humanities (King’s College London CY - London ER - TY - JOUR U1 - Zeitschriftenartikel, wissenschaftlich - begutachtet (reviewed) A1 - Bärenfänger, Maja A1 - Hilbert, Mirco A1 - Lobin, Henning A1 - Lüngen, Harald T1 - OWL ontologies as a resource for discourse parsing JF - LDV-Forum - GLDV-Journal for Computational 
Linguistics and Language Technology N2 - In the project SemDok (Generic document structures in linearly organised texts) funded by the German Research Foundation DFG, a discourse parser for a complex type (scientific articles by example), is being developed. Discourse parsing (henceforth DP) according to the Rhetorical Structure Theory (RST) (Mann and Taboada, 2005; Marcu, 2000) deals with automatically assigning a text a tree structure in which discourse segments and rhetorical relations between them are marked, such as Concession. For identifying the combinable segments, declarative rules are employed, which describe linguistic and structural cues and constraints about possible combinations by referring to different XML annotation layers of the input text, and external knowledge bases such as a discourse marker lexicon, a lexico-semantic ontology (later to be combined with a domain ontology), and an ontology of rhetorical relations. In our text-technological environment, the obvious choice of formalism to represent such ontologies is OWL (Smith et al., 2004). In this paper, we describe two OWL ontologies and how they are consulted from the discourse parser to solve certain tasks within DP. The first ontology is a taxononomy of rhetorical relations which was developed in the project. The second one is an OWL version of GermaNet, the model of which we designed together with our project partners. KW - Textstruktur KW - Strukturbaum KW - Parser KW - Ontologie Y1 - 2008 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-76105 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-76105 UR - http://www.jlcl.org/2008_Heft1/LDV_Forum_23_(1).pdf SN - 0175-1336 SS - 0175-1336 VL - 23 IS - 1 SP - 17 EP - 26 PB - Gesellschaft für Linguistische Datenverarbeitung CY - Bonn ER - TY - CHAP U1 - Konferenzveröffentlichung A1 - Hilbert, Mirco A1 - Lobin, Henning A1 - Bärenfänger, Maja A1 - Lüngen, Harald A1 - Puskás, Csilla ED - Butt, Miriam T1 - A text-technological approach to automatic discourse analysis of complex texts T2 - Proceedings of KONVENS 2006 (Konferenz zur Verarbeitung natürlicher Sprache) N2 - This paper describes the development of a rela­tional discourse parsing architecture for text doc­uments of a complex text type, namely scientific articles. To achieve this goal, several different linguistic knowledge sources and auxiliary ana­lyses on different linguistic levels are necessary. KW - Parser KW - Textstruktur KW - Wissenschaftssprache KW - Diskursanalyse Y1 - 2006 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:352-opus-20131 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:352-opus-20131 SN - 3-89318-050-8 SB - 3-89318-050-8 SP - 52 EP - 55 PB - Universität Konstanz CY - Konstanz ER - TY - CHAP U1 - Buchbeitrag A1 - Bärenfänger, Maja A1 - Lüngen, Harald A1 - Hilbert, Mirco A1 - Lobin, Henning ED - Kühnlein, Peter ED - Benz, Anton ED - Sidner, Candace L. T1 - The role of generic and logical document structure in relational discourse analysis T2 - Constraints in Discourse 2 N2 - This study examines what kind of cues and constraints for discourse interpretation can be derived from the logical and generic document structure of complex texts by the example of scientific journal articles. We performed statistical analysis on a corpus of scientific articles annotated on different annotations layers within the framework of XML-based multi-layer annotation. We introduce different discourse segment types that constrain the textual domains in which to identify rhetorical relation spans, and we show how a canonical sequence of text type structure categories is derived from the corpus annotations. Finally, we demonstrate how and which text type structure categories assigned to complex discourse segments of the type “block” statistically constrain the occurrence of rhetorical relation types. T3 - Pragmatics & beyond : new series - 194 KW - Diskursanalyse KW - Texttechnologie KW - Korpus KW - Wissenschaftssprache KW - Logical Document Structure KW - Generic Document Structure KW - Annotation Y1 - 2010 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-77647 U6 - https://dx.doi.org/10.1075/pbns.194.05bar DO - https://dx.doi.org/10.1075/pbns.194.05bar N1 - This is a postprint of an article that was published in the book "Constraints in Discourse 2". The published article is under copyright of Benjamins. The publisher should be contacted for permission to re-use or reprint the material in any form. SP - 81 EP - 104 PB - Benjamins CY - Amsterdam/ Philadelphia ER - TY - CHAP U1 - Konferenzveröffentlichung A1 - Lüngen, Harald A1 - Herzberg, Laura ED - Vandekerckhove, Reinhild ED - Fišer, Darja ED - Hilte, Lisa T1 - Reply relations in CMC: types and annotation T2 - Proceedings of the 6th Conference on Computer-Mediated Communication (CMC) and Social Media Corpora (CMC-corpora 2018), 17-18 September 2018, University of Antwerp N2 - This paper analyses reply relations in computer-mediated communication (CMC), which occur between post units in CMC interactions and which describe references between posts. We take a look at existing practices in the description and annotation of such relations in chat, wiki talk, and blog corpora. We distinguish technical reply structures, indentation structures, and interpretative reply relations, which include reply relations induced by linguistic markers. We sort out the different levels of description and annotation that are involved and propose a solution for their combined representation within the TEI annotation framework. KW - reply relations KW - computer-mediated communication KW - CMC KW - corpus annotation KW - TEI KW - Korpus KW - Annotation KW - Text Encoding Initiative KW - Computerunterstützte Kommunikation Y1 - 2018 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-80414 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-80414 UR - https://www.uantwerpen.be/en/conferences/cmc-social-media-2018/proceedings/ SN - 9789057285868 SB - 9789057285868 SP - 49 EP - 52 PB - University of Antwerp CY - Antwerpen ER - TY - CHAP U1 - Buchbeitrag A1 - Lüngen, Harald A1 - Herzberg, Laura ED - Vandekerckhove, Reinhild ED - Fišer, Darja ED - Hilte, Lisa T1 - Reply relations in CMC: types and annotation T2 - Proceedings of the 6th Conference on Computer-Mediated Communication (CMC) and Social Media Corpora (CMC-corpora 2018), 17-18 September 2018, University of Antwerp N2 - This paper analyses reply relations in computer-mediated communication (CMC), which occur between post units in CMC interactions and which describe references between posts. We take a look at existing practices in the description and annotation of such relations in chat, wiki talk, and blog corpora. We distinguish technical reply structures, indentation structures, and interpretative reply relations, which include reply relations induced by linguistic markers. We sort out the different levels of description and annotation that are involved and propose a solution for their combined representation within the TEI annotation framework. KW - reply relations KW - computer-mediated communication KW - CMC KW - corpus annotation KW - TEI KW - Computerunterstützte Kommunikation KW - Korpus KW - Annotation KW - Text Encoding Initiative KW - Antwort Y1 - 2018 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-81268 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-81268 UR - https://www.uantwerpen.be/en/conferences/cmc-social-media-2018/proceedings/ SN - 9789057285868 SB - 9789057285868 SP - 49 EP - 52 PB - University of Antwerp CY - Antwerpen ER - TY - JOUR U1 - Zeitschriftenartikel, wissenschaftlich - begutachtet (reviewed) A1 - Lüngen, Harald A1 - Herzberg, Laura T1 - Types and annotation of reply relations in computer-mediated communication JF - European Journal of Applied Linguistics (EuJAL) N2 - This paper presents types and annotation layers of reply relations in computer- mediated communication (CMC). Reply relations hold between post units in CMC interactions and describe references from one given post to a previous post. We classify three types of reply relations in CMC interactions: first, technical replies, i. e. the possibility to reply directly to a previous post by clicking a ‘reply’ button; second, indentations, e. g. in wiki talk pages in which users insert their contributions in the existing talk page by indenting them and third, interpretative reply relations, i. e. the reply action is not realised formally but signalled by other structural or linguistics means such as address markers ‘@’, greetings, citations and/or Q-A structures. We take a look at existing practices in the description and representation of such relations in corpora and examples of chat, Wikipedia talk pages, Twitter and blogs. We then provide an annotation proposal that combines the different levels of description and representation of reply relations and which adheres to the schemas and practices for encoding CMC corpus documents within the TEI framework as defined by the TEI CMC SIG. It constitutes a prerequisite for correctly identifying higher levels of interactional relations such as dialogue acts or discussion trees. N2 - Der vorliegende Artikel stellt Typen und Annotationsebenen von Antwortrelationen in der internetbasierten Kommunikation (IBK) vor. Antwortrelationen bestehen zwischen Posts in IBK-Interaktionen und beschreiben Referenzen, die zwischen einem Initialbeitrag und einem Folgebeitrag bestehen. Wir klassifizieren drei Arten von Antwortrelationen in IBK-Interaktionen: erstens, technische Antwortrelationen, welche dadurch gekennzeichnet sind, dass durch das Betätigen einer „Antwort“-Schaltfläche eine Antwort initiiert wird, bspw. in Blogs; zweitens, Einrückungen, z. B. auf Wikipedia-Diskussionsseiten, in denen Benutzer ihre Beiträge in die entsprechende Stelle des Diskussionsverlaufs einfügen, indem sie ihre Beiträge einrücken und drittens, interpretative Antwortrelationen, bei denen die Antwort nicht formal realisiert wird, sondern durch andere strukturelle oder linguistische Mittel signalisiert werden, wie z. B. dem Adressierungsmarker „@“, Begrüßungs- und Verabschiedungsformeln, Zitaten und/oder Frage-Antwort-Strukturen. Wir analysieren die bestehenden Praktiken bei der Beschreibung und Darstellung solcher Relationen in Korpora und geben Beispiele für Chat, Wikipedia-Diskussionsseiten, Twitter und Blogs. Anschließend präsentieren wir einen Annotationsvorschlag, der die verschiedenen Ebenen der Beschreibung und Darstellung von Antwortrelationen kombiniert und sich an die Praktiken zur Kodierung von IBK-Korpusdokumenten innerhalb der Text Encoding Initiative (TEI), wie sie von der TEI CMC SIG definiert wurde, hält. Die Annotation von Antwortrelationen stellt eine Voraussetzung für die korrekte Identifizierung höherer interaktionaler Ebenen, wie z. B. die Klassifizierung von Dialogakten oder Baumstrukturen, dar. N2 - Este documento introduce tipos y capas de anotación de las relaciones de respuesta en la comunicación mediada por ordenador (CMC). Las relaciones de respuesta se mantienen entre las unidades de mensaje de las interacciones de CMC y describen referencias de un mensaje dado a un mensaje anterior. Clasificamos tres tipos de relaciones de respuesta en las interacciones de CMC: primero, las respuestas técnicas, es decir, la posibilidad de responder directamente a un mensaje anterior usando el botón “responder”; segundo, hendiduras, por ejemplo, en las páginas de discusión de Wikipedia en las que los usuarios insertan sus contribuciones en la página de conversación existente al indentarlos, y la tercera, relaciones interpretativas de respuesta, es decir, la acción de respuesta no se realiza formalmente, sino que se señala por otros medios estructurales o lingüísticos, como los marcadores de dirección ‘@’, saludos, citas y/o estructuras de pregunta y respuesta. Vamos a mirar a las prácticas existentes en la descripción y representación de tales relaciones en los corpus y ejemplos de chat, páginas de discusión de Wikipedia, Twitter y blogs. A continuación, proporcionamos una propuesta de anotación que combina los diferentes niveles de descripción y representación de las relaciones de respuesta y que se adhiere a los esquemas y prácticas para codificar documentos de corpus CMC dentro del marco TEI, tal como se define en el TEI CMC SIG. Esto forma un prerrequisito para identificar correctamente los niveles más elevados de relaciones interaccionales, como los actos de diálogo o los árboles de discusión. KW - reply relations KW - corpus annotation KW - computer-mediated communication (CMC) KW - Text Encoding Initiative (TEI) KW - Antwortrelationen KW - Antwortstrukturen KW - Korpusannotation KW - internetbasierte Kommunikation (IBK) KW - relaciones de respuesta KW - anotación de corpus KW - comunicación mediada por computadora (CMC) KW - Computerunterstützte Kommunikation KW - Korpus KW - Annotation Y1 - 2019 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-92645 SN - 2192-9521 SS - 2192-9521 U6 - https://dx.doi.org/10.1515/eujal-2019-0006 DO - https://dx.doi.org/10.1515/eujal-2019-0006 N1 - Dieser Beitrag ist mit Zustimmung des Rechteinhabers aufgrund einer (DFG geförderten) Allianz- bzw. Nationallizenz frei zugänglich. This publication is with permission of the rights owner freely accessible due to an Alliance licence and a national licence (funded by the DFG, German Research Foundation) respectively. VL - 7 IS - 2 SP - 305 EP - 331 PB - de Gruyter CY - Berlin [u.a.] ER - TY - CHAP U1 - Konferenzveröffentlichung A1 - Beißwenger, Michael A1 - Herzberg, Laura A1 - Lüngen, Harald A1 - Wigham, Ciara R. ED - Longhi, Julien ED - Marinica, Claudia T1 - cmc-core: a basic schema for encoding CMC corpora in TEI T2 - Proceedings of the 7th Conference on CMC and Social Media Corpora for the Humanities (CMC-Corpora2019) 9-10 September 2019. Cergy-Pontoise University, France N2 - Since 2013 representatives of several French and German CMC corpus projects have developed three customizations of the TEI-P5 standard for text encoding in order to adapt the encoding schema and models provided by the TEI to the structural peculiarities of CMC discourse. Based on the three schema versions, a 4th version has been created which takes into account the experiences from encoding our corpora and which is specifically designed for the submission of a feature request to the TEI council. On our poster we would present the structure of this schema and its relations (commonalities and differences) to the previous schemas. KW - CMC KW - cmc corpora KW - standard KW - TEI KW - Korpus KW - Text Encoding Initiative KW - Deutsch KW - Englisch Y1 - 2019 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-92668 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-92668 UR - https://cmccorpora19.sciencesconf.org/ UR - https://cmccorpora19.sciencesconf.org/data/pages/proceedingsCMC_Corpora2019.pdf SP - 74 EP - 75 PB - Cergy-Pontoise University, France CY - Cergy-Pontoise, France ER - TY - JOUR U1 - Zeitschriftenartikel, wissenschaftlich - begutachtet (reviewed) A1 - Lüngen, Harald A1 - Kupietz, Marc T1 - Das Deutsche Referenzkorpus DEREKO im Jubiläumsjahr 2014 JF - Sprachreport KW - Institut für Deutsche Sprache KW - Korpus KW - Deutsches Referenzkorpus (DeReKo) Y1 - 2014 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-30396 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-30396 SN - 0178-644X SS - 0178-644X VL - 30 IS - 3 SP - 24 EP - 26 PB - Institut für Deutsche Sprache CY - Mannheim ER - TY - JOUR U1 - Zeitschriftenartikel, wissenschaftlich - begutachtet (reviewed) A1 - Lüngen, Harald T1 - DeReKo-Archiv jetzt mit fünf Milliarden Textwörtern BT - Zum größten digitalen Textarchiv für deutsche Texte der Gegenwart JF - Sprachreport : Informationen und Meinungen zur deutschen Sprache KW - Deutsch KW - Korpus KW - Textkorpus KW - Deutsches Referenzkorpus (DeReKo) KW - Institut für Deutsche Sprache Y1 - 2012 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-35328 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-35328 SN - 0178-644X SS - 0178-644X VL - 28 IS - 1 SP - 26 EP - 26 ER - TY - CHAP U1 - Buchbeitrag A1 - Lüngen, Harald A1 - Kupietz, Marc ED - Marx, Konstanze ED - Lobin, Henning ED - Schmidt, Axel T1 - IBK- und Social Media-Korpora am Leibniz-Institut für Deutsche Sprache T2 - Deutsch in sozialen Medien: Interaktiv – multimodal – vielfältig N2 - Der Beitrag untersucht vorhandene Lösungen und neue Möglichkeiten des Korpusausbaus aus Social Media- und internetbasierter Kommunikation (IBK) für das Deutsche Referenzkorpus (DEREKO). DEREKO ist eine Sammlung gegenwartssprachlicher Schriftkorpora am IDS, die der sprachwissenschaftlichen Öffentlichkeit über die Korpusschnittstellen COSMAS II und KorAP angeboten wird. Anhand von Definitionen und Beispielen gehen wir zunächst auf die Extensionen und Überlappungen der Konzepte Social Media, Internetbasierte Kommunikation und Computer-mediated Communication ein. Wir betrachten die rechtlichen Voraussetzungen für einen Korpusausbau aus Sozialen Medien, die sich aus dem kürzlich in relevanten Punkten reformierten deutschen Urheberrecht, aus Persönlichkeitsrechten wie der europäischen Datenschutz-Grundverordnung ergeben und stellen Konsequenzen sowie mögliche und tatsächliche Umsetzungen dar. Der Aufbau von Social Media-Korpora in großen Textmengen unterliegt außerdem korpustechnologischen Herausforderungen, die für traditionelle Schriftkorpora als gelöst galten oder gar nicht erst bestanden. Wir berichten, wie Fragen der Datenaufbereitung, des Korpus-Encoding, der Anonymisierung oder der linguistischen Annotation von Social Media Korpora für DEREKO angegangen wurden und welche Herausforderungen noch bestehen. Wir betrachten die Korpuslandschaft verfügbarer deutschsprachiger IBK- und Social Media-Korpora und geben einen Überblick über den Bestand an IBK- und Social Media-Korpora und ihre Charakteristika (Chat-, Wiki Talk- und Forenkorpora) in DEREKO sowie von laufenden Projekten in diesem Bereich. Anhand korpuslinguistischer Mikro- und Makro-Analysen von Wikipedia-Diskussionen im Vergleich mit dem Gesamtbestand von DEREKO zeigen wir charakterisierende sprachliche Eigenschaften von Wikipedia-Diskussionen auf und bewerten ihren Status als Repräsentant von IBK-Korpora. T3 - Jahrbuch / Leibniz-Institut für Deutsche Sprache (IDS) - - 2019 KW - Deutsch KW - Soziale Medien KW - Leibniz-Institut für Deutsche Sprache (IDS) KW - Korpus KW - Internetkommunikation Y1 - 2020 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-97052 SN - 978-3-11-067886-4 SB - 978-3-11-067886-4 SN - 9783110679885 (Online) SB - 9783110679885 (Online) U6 - https://dx.doi.org/10.1515/9783110679885-016 DO - https://dx.doi.org/10.1515/9783110679885-016 SP - 319 EP - 342 PB - de Gruyter CY - Berlin [u.a.] ER - TY - GEN U1 - Sonstiges A1 - Arnold, Denis A1 - Fankhauser, Peter A1 - Fisseni, Bernhard A1 - Kupietz, Marc A1 - Lüngen, Harald A1 - Schmidt, Thomas A1 - Witt, Andreas T1 - Datenübernahmerichtlinien des Leibniz-Instituts für Deutsche Sprache KW - Datenschutzrichtlinie KW - Datenschutz KW - Forschungsdaten KW - Korpus Y1 - 2019 UN - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:bsz:mh39-87919 UR - https://repos.ids-mannheim.de/resources/LZA_IDS_Depositing_Policy.pdf U6 - https://dx.doi.org/10.14618/ids-pub-8791 DO - https://dx.doi.org/10.14618/ids-pub-8791 SP - 10 S1 - 10 PB - Leibniz-Institut für Deutsche Sprache (IDS) CY - Mannheim ET - Version 24.01.2019 ER -