Refine
Year of publication
Document Type
- Conference Proceeding (18)
- Article (6)
Has Fulltext
- yes (24)
Keywords
- Computerlinguistik (24) (remove)
Publicationstate
- Zweitveröffentlichung (24) (remove)
Reviewstate
- Peer-Review (24) (remove)
Publisher
- Association for Computing Machinery (2)
- European Language Resources Association (2)
- Zenodo (2)
- ACTA Press (1)
- Asian Federation of Natural Language Processing (1)
- Association for Computational Linguistics (1)
- Bibliographisches Institut (1)
- CLARIN (1)
- Fundacja Uniwersytetu im. Adama Mickiewicza (1)
- Gesellschaft für Informatik e.V. (1)
A constructicon, i.e., a structured inventory of constructions, essentially aims at documenting functions of lexical and grammatical constructions. Among other parameters, so-called constructional collo-profiles, as introduced by Herbst (2018, 2020), are conclusive for determining constructional meanings. They provide information on how relevant individual words are for construction slots, they hint at usage preferences of constructions and serve as a helpful indicator for semantic peculiarities of constructions. However, even though collo-profiles constitute an indispensable component of constructicon entries, they pose major challengers for constructicographers: For a constructicographic enterprise it is not feasible to conduct collostructional analyses for hundreds or even thousands of constructions. In this article, we introduce a procedure based on the large language model BERT that allows to predict collo-profiles without having to extensively annotate instances of constructions in a given corpus. Specifically, by discussing the constructions X macht Y ADJP (‘x makes Y ADJ’, e.g. he drives him crazy) and N1 PREP N1 (e.g., bumper to bumper, constructions over constructions), we show how the developed automated system generates collo-profiles based on a limited number of annotated instances. Finally, we place collo-profiles alongside other dimensions of constructional meanings included in the German Constructicon.
Automatic summarization systems usually are trained and evaluated in a particular domain with fixed data sets. When such a system is to be applied to slightly different input, labor- and cost-intensive annotations have to be created to retrain the system. We deal with this problem by providing users with a GUI which allows them to correct automatically produced imperfect summaries. The corrected summary in turn is added to the pool of training data. The performance of the system is expected to improve as it adapts to the new domain.
In this paper we present work in developing a computerized grammar for the Latin language. It demonstrates the principles and challenges in developing a grammar for a natural language in a modern grammar formalism. The grammar presented here provides a useful resource for natural language processing applications in different fields. It can be easily adopted for language learning and use in language technology for Cultural Heritage like translation applications or to support post-correction of document digitization.
We present a supervised machine learning AND system which tackles semantic similarity between publication titles by means of word embeddings. Word embeddings are integrated as external components, which keeps the model small and efficient, while allowing for easy extensibility and domain adaptation. Initial experiments show that word embeddings can improve the Recall and F score of the binary classification sub-task of AND. Results for the clustering sub-task are less clear, but also promising and overall show the feasibility of the approach.
In this contribution we present some work of the R&D European project “LIRICS” and of the ISO/TC 37/SC 4 committee related to the topic of interoperability and re-use of language resources. We introduce some basic mechanisms of the standardization work in ISO and describe in more details the general approach on how to cope with the annotation of language data within ISO.
We present web services implementing a workflow for transcripts of spoken language following TEI guidelines, in particular ISO 24624:2016 "Language resource management - Transcription of spoken language". The web services are available at our website and will be available via the CLARIN infrastructure, including the Virtual Language Observatory and WebLicht.
While good results have been achieved for named entity recognition (NER) in supervised settings, it remains a problem that for low resource languages and less studied domains little or no labelled data is available. As NER is a crucial preprocessing step for many natural language processing tasks, finding a way to overcome this deficit in data remains of great interest. We propose a distant supervision approach to NER that is both language and domain independent where we automatically generate labelled training data using gazetteers that we previously extracted from Wikipedia. We test our approach on English, German and Estonian data sets and contribute further by introducing several successful methods to reduce the noise in the generated training data. The tested models beat baseline systems and our results show that distant supervision can be a promising approach for NER when no labelled data is available. For the English model we also show that the distant supervision model is better at generalizing within the same domain of news texts by comparing it against a supervised model on a different test set.
Opinion holder extraction is one of the most important tasks in sentiment analysis. We will briefly outline the importance of predicates for this task and categorize them according to part of speech and according to which semantic role they select for the opinion holder. For many languages there do not exist semantic resources from which such predicates can be easily extracted. Therefore, we present alternative corpus-based methods to gain such predicates automatically, including the usage of prototypical opinion holders, i.e. common nouns, denoting for example experts or analysts, which describe particular groups of people whose profession or occupation is to form and express opinions towards specific items.