Refine
Year of publication
Document Type
- Working Paper (26) (remove)
Language
- English (26) (remove)
Keywords
- Korpus <Linguistik> (11)
- Deutsch (5)
- Forschungsdaten (5)
- CLARIN (4)
- Infrastruktur (4)
- Gesprochene Sprache (3)
- Segmentierung (3)
- Computerlinguistik (2)
- DARIAH (2)
- Frame-Semantik (2)
Publicationstate
- Veröffentlichungsversion (17)
- Preprint (1)
Reviewstate
- (Verlags)-Lektorat (7)
- Peer-Review (3)
Publisher
- Zenodo (4)
- Institut für Deutsche Sprache (3)
- CLARIN Legal and Ethical Issues Committee (CLIC) (2)
- Institut für Phonetik und Sprachliche Kommunikation, Ludwig Maximilians Universität München (2)
- International Computer Science Institute (2)
- AIFB (1)
- DYLAN Project (1)
- Freie Universität Berlin (1)
- GOEDOC, Dokumenten- und Publikationsserver der Georg-August-Universität (1)
- Heinrich-Heine-Universität (1)
This introductory tutorial describes a strictly corpus-driven approach for uncovering indications for aspects of use of lexical items. These aspects include ‘(lexical) meaning’ in a very broad sense and involve different dimensions, they are established in and emerge from respective discourses. Using data-driven mathematical-statistical methods with minimal (linguistic) premises, a word’s usage spectrum is summarized as a collocation profile. Self-organizing methods are applied to visualize the complex similarity structure spanned by these profiles. These visualizations point to the typical aspects of a word’s use, and to the common and distinctive aspects of any two words.
In this paper, a method for measuring synchronic corpus (dis-)similarity put forward by Kilgarriff (2001) is adapted and extended to identify trends and correlated changes in diachronic text data, using the Corpus of Historical American English (Davies 2010a) and the Google Ngram Corpora (Michel et al. 2010a). This paper shows that this fully data-driven method, which extracts word types that have undergone the most pronounced change in frequency in a given period of time, is computationally very cheap and that it allows interpretations of diachronic trends that are both intuitively plausible and motivated from the perspective of information theory. Furthermore, it demonstrates that the method is able to identify correlated linguistic changes and diachronic shifts that can be linked to historical events. Finally, it can help to improve diachronic POS tagging and complement existing NLP approaches. This indicates that the approach can facilitate an improved understanding of diachronic processes in language change.
CLARIAH-DE cross-service search - prospects and benefits of merging subject-specific services
(2021)
CLARIAH-DE combines services and offerings of CLARIN-D and DARIAH-DE. This includes various search applications which are made directly available to researchers. These search applications are presented in this working paper based on their main characteristics and compared with a focus on possible harmonizations. Opportunities and risks of different forms of technical integration are highlighted. Identified challenges can be explained in particular considering the background of different organizational and technical frameworks as well as highly specific and discipline-dependent requirements. The integration work that has already been carried out and the experiences gained with regard to future work and possible integration of further applications are also discussed. The experiences made in CLARIAH-DE can especially be of interest for other projects in the field of digital research infrastructures.
Collaborative work in NFDI
(2023)
The non-profit association National Research Data Infrastructure (NFDI) promotes science and research through a National Research Data Infrastructure. Its aim is to develop and establish an overarching research data management (RDM) for Germany and to increase the efficiency of the entire German science system. After a two-and-a-half year build up phase, the process of adding new consortia, each representing a different data domain, has ended in March 2023. NFDI now has 26 disciplinary consortia (and one additional basic service collaboration). Now the full extent of cross-consortial interaction is beginning to show.
The paper deals with the process of computer-aided transcription regarding Arabic-German data material for interaction-based studies. First of all, it sheds light upon some major methodological challenges posed by the conversation-analytic approaches: due to current corpus technology, the reciprocity, linearity, and simultaneity of linguistic activities cannot be reconstructed in an analytically proper way when using the Arabic characters in multilingual and bidirectional transcripts. The difficulty of transcribing Arabic encounters is also compounded by the fact that Spoken Arabic as well as its varieties and phenomena have not been standardised enough (for conversation-analytic purposes). Therefore, the second part of this paper is dedicated to preliminary, self-developed solutions, namely a systematic method for transcribing Spoken Arabic.
A topic in the field of knowledge acquisition is the reuse of components that are described at the knowledge level. Problems concern the description, indexing and retrieval of components. In our case there is the additional feature of integrating so called automated building blocks in a knowledge level description. This paper describes what knowledge level descriptions of components for reuse should look like, and proposes a way to describe assumptions and requirements that are to be made explicit. In the paper an extension of the “normal” knowledge acquisition setting is made in the direction of machine learning components.
The landscape of digital lexical resources is often characterized by dedicated local portals and proprietary interfaces as primary access points for scholars and the interested public. In addition, legal and technical restrictions are potential issues that can make it difficult to efficiently query and use these valuable resources. As part of the research data consortium Text+, solutions for the storage and provision of digital language resources are being developed and provided in the context of the unified cross-domain German research data infrastructure NFDI. The specific topic of accessing lexical resources in a diverse and heterogenous landscape with a variety of participating institutions and established technical solutions is met with the development of the federated search and query framework LexFCS. The LexFCS extends the established CLARIN Federated Content Search that already allows accessing spatially distributed text corpora using a common specification of technical interfaces, data formats, and query languages. This paper describes the current state of development of the LexFCS, gives an insight into its technical details, and provides an outlook on its future development.
The possibilities of re-use and archiving of spoken and written corpora are affected by personality rights (depending on legal tradition also called: the right of publicity), copyright law and data protection / privacy laws. These recommendations include information about legal aspects which should be considered while creating corpora to ensure the greatest archivability and re-usability possible in compliance with current laws.
The information compiled here shall serve researchers who plan to create corpora or who are involved in evaluation of such measures as a guideline. This information is not exhaustive or to be considered as legal advice. Researchers should consult institutional legal departments and management before making legally relevant decisions. That said, further legal expertise should be sought if possible as early as project planning phases.
KoralQuery 0.3
(2015)
KoralQuery is a general corpus query protocol (i.e. independent of research tasks and corpus formats), serialized in JSON-LD [1]. KoralQuery focuses on simplicity of implementation rather than human readibility and writability. Support for a growing number of query languages is granted by the Koral serialization processor.
The General Data Protection Regulation (hereinafter: GDPR), EU Regulation 2016/679 of 27 April 2016, will become applicable on 25 May 2018 and repeal the Personal Data Directive of 24 October 1995.
Unlike a directive, which requires transposition into national laws (while leaving the choice of “forms and methods” to the Member States), a regulation is binding and directly applicable in all Member States. This means that when the GDPR becomes applicable, all the EU countries will have the same rules regarding the protection of personal data — at least in principle, since some details (including in the area of research — see below) are expressly left to the discretion of the Member States.
The GDPR is a particularly ambitious piece of legislation (consisting of 99 articles and 173 recitals) whose intended territorial scope extends beyond the borders of the European Union. Its main concepts and principles are essentially similar to those of the Personal Data Directive, but enriched with interpretation developed through the case law of the CJEU and the opinions of the Article 29 Data Protection Working Party (hereinafter: WP29).
This White Paper will discuss the main principles of data protection and their impact on language resources, as well as special rules regarding research under the GDPR and the standardisation mechanisms recognized by the Regulation.
Psych-verbs, exhibiting a wide range of construction types, challenge the assumption that verbs sharing the same theta-grid and event structure select the same case patterns. Consequently, previous research has tried to trace back the constructional variants of these verbs to differences in deep syntactic or semantic (event- or causal) structure. However, in German a construction type may exhibit semantically distinct subclasses, and a semantic subclass may have different syntactic realizations. Thus, it is difficult to find general linking patterns of psych-verbs. On the other hand, it is characteristic of psych-verbs to have non-psych-readings as well. We argue that these readings determine case selection. In the centre of our argumentation is the principle of Lexical Economy according to which entries in the mental lexicon are as simple as possible. The case frame must be compatible with all readings of the verb. Thus, the reading with the strongest thematic distinctiveness of each argument is crucial for case selection, typically a reading with a volitional agent and a strongly affected patient. Such readings impose stronger restrictions on case selection than psych-readings. The linking theory of our approach is based on Dowty’s theory of proto-roles (1991) and its modification and implementation in Optimality Theory in Primus (2002).
Mechanism-based thinking on policy diffusion. A review of current approaches in political science
(2011)
Despite theoretical and methodological progress in what is now coined as the third generation of diffusion studies, explicitly dealing with the causal mechanisms underlying diffusion processes and comparatively analyzing them is only of recent date. As a matter of fact, diffusion research has ended up in a diverse and often unconnected array of theoretical assumptions relying both on rational as well as constructivist reasoning – a circumstance calling for more theoretical coherence and consistency. Against this backdrop, this paper reviews and streamlines diffusion literature in political science. Diffusion mechanisms largely cluster around two causal arguments determining the desires and preferences of actors for choosing alternative policies. First, existing diffusion mechanisms accounts can be grouped according to the rationality for policy adoption, this means that government behavior is based on the instrumental considerations of actors or on constructivist arguments like norms and rule-driven actors. Second, diffusion mechanisms can either directly impact on the beliefs of actors or they might influence the structural conditions for decision-making. Following this logic, four basic diffusion mechanisms can be identified in mechanism-based thinking on policy diffusion: emulation, socialization, learning, and externalities.
New KARL (Knowledge Acquisition and Representation Language) allows to specify all parts of a problem-solving method (PSM). It is a formal language with a well-defined semantics and thus allows to represent PSMs precisely and unambiguously yet abstracting from implementation detail. In this paper it is shown how the language KARL has been modified and extended to New KARL to better meet the needs for the representation of PSMs. Based on a conceptual structure of PSMs new language primitives are introduced for KARL to specify such a conceptual structure and to support the configuration of methods. An important goal for this extension was to preserve three important properties of KARL: to be (i) a conceptual, (ii) a formal, and (iii) an executable language.
Psych-verbs have been a touch-stone to linking-theories, which assume that case selection is determined by thematic roles. Though psych-verbs share the same thematic grid (experiencer and stimulus), they show different case frames. Different syntactic (structural) and semantic (event- or causal structure) approaches exist, but at least in German we will show that none of the several approaches to psych-verbs cover all differences and similarities of the several formal classes of psych-verbs in that language. In this paper we argue that the case selection of psych-verbs does not depend on their psychical reading at all. While Functional Expressivity requires that different thematic roles are expressed by different forms, Lexical Economy states that lexical entries should be minimally, i.e. verbs should only provide one case frame. Thus, the case frame of a verb must be compatible to the thematic requirements of all readings of this verb. Researchers paid little attention to the fact that polysemy is characteristic for psych-verbs. Psychverbs have (or have had) other, more specific readings, as well, and occasional psychical readings are possible for most verbs. According to the proto-role approach of Dowty (1991) and its modifications by Primus (1999b, 2002a, 2002b, 2002c), case selection is determined by the grade of agentivity or patientivity of arguments. Concrete readings have stronger agents and patients and make therefore stronger restrictions to case selection, and the psychical reading of a verb is always compatible with this reading. Thus, the case selection of psych-verbs is not affected by its psychical reading.
This technology watch report discusses digital repository solutions, in the context of the research infrastructure projects CLARIAH-DE, CLARIN, and DARIAH. It provides an overview of different repository systems, comparing them and discussing their respective applicabilities from the perspectives of the project partners at the time of writing.
Our research task consists in the study of the way in which multilingual resources are mobilized in team work within collaborative activities; how they are exploited in a specific way in order both to enhance collaboration and to respect the specificities of the members’ linguistic competences and practices within the team. Central to our analytical work, which is inspired by ethnomethodological conversation analysis, is the relationship between multilingual resources and the situated organization of linguistic uses and of social practices. These two aspects are reflexively articulated, multilingual resources being shaped by the very contexts of their use and activities being constrained and thus structured by the available resources.
Who understands Low German today and who can speak it? Who makes use of media and cultural events in Low German? What images do people in northern Germany associate with Low German and what is their view of their regional language?
These and further questions are answered in this brochure with the help of representative data collected in a telephone survey of a total of 1,632 people from eight federal states (Bremen, Hamburg, Lower Saxony, Mecklenburg-West Pomerania and Schleswig-Holstein as well as Brandenburg, North Rhine-Westphalia and Saxony-Anhalt).
The Partitur Format at BAS
(1997)
Most spoken language resources are produced and disseminated together with symbolic information relating to the speech signal. These are for instance orthographic transcript labeling and segmentation on the phonologic phoneti prosodic phrasal level. Most of the known formats for these symbolic data are defined in a ‘closed form’ that is not fexible enough to allow simple and platform independent processing and easy extensions.
At the Bavarian Archive for Speech Signals (BAS) a new format has been developed and used over the last few years that shows some significant advantages over other existing formats. This paper describes the basic principles behind this format discusses briefly the advantages and gives detailed definitions of the description levels used so far.
This White Paper sets out commonly agreed definitions on activities of consortia within NFDI. It aims to provide a common basis for reporting and reference regarding selected questions of cross-consortial relevance in DFG’s template for the Interim Reports. The questions were prioritised by an NFDI Task Force on Evaluation and Reporting (formerly Task Force Monitoring) as a result of discussing possible answers to the DFG template. In this process the need to agree on a generalizable meaning of terms commonly used in the context of NFDI, and reporting in particular, were identified from cross-consortial perspectives. Questions that showed the highest requirement on clarification are discussed in this White Paper. As NFDI evolves, the Task Force will likely propose further joint approaches for reporting in information infrastructures.
While each of broad relevance, the questions addressed relate to substantially different aspects of consortia’s work. They are thus also structured slightly different.