Lecture Notes in Computer Science
Refine
Document Type
- Conference Proceeding (4)
- Part of a Book (1)
Language
- English (5)
Has Fulltext
- yes (5)
Keywords
- Semantik (2)
- Veröffentlichung (2)
- author name disambiguation (2)
- semantic similarity (2)
- word embeddings (2)
- API (1)
- Automatische Sprachanalyse (1)
- Computerlinguistik (1)
- Contrary and complementary opposites (1)
- Datenbank (1)
Publicationstate
- Postprint (5)
- Zweitveröffentlichung (3)
Reviewstate
- Peer-Review (5) (remove)
Publisher
- Springer (5)
10450
We present a supervised machine learning AND system which tackles semantic similarity between publication titles by means of word embeddings. Word embeddings are integrated as external components, which keeps the model small and efficient, while allowing for easy extensibility and domain adaptation. Initial experiments show that word embeddings can improve the Recall and F score of the binary classification sub-task of AND. Results for the clustering sub-task are less clear, but also promising and overall show the feasibility of the approach.
11799
The demo presents a minimalist, off-the-shelf AND tool which provides a fundamental AND operation, the comparison of two publications with ambiguous authors, as an easily accessible HTTP interface. The tool implements this operation using standard AND functionality, but puts particular emphasis on advanced methods from natural language processing (NLP) for comparing publication title semantics.
7295
The ISOcat registry reloaded
(2012)
The linguistics community is building a metadata-based infrastructure for the description of its research data and tools. At its core is the ISOcat registry, a collaborative platform to hold a (to be standardized) set of data categories (i.e., field descriptors). Descriptors have definitions in natural language and little explicit interrelations. With the registry growing to many hundred entries, authored by many, it is becoming increasingly apparent that the rather informal definitions and their glossary-like design make it hard for users to grasp, exploit and manage the registry’s content. In this paper, we take a large subset of the ISOcat term set and reconstruct from it a tree structure following the footsteps of schema.org. Our ontological re-engineering yields a representation that gives users a hierarchical view of linguistic, metadata-related terminology. The new representation adds to the precision of all definitions by making explicit information which is only implicitly given in the ISOcat registry. It also helps uncovering and addressing potential inconsistencies in term definitions as well as gaps and redundancies in the overall ISOcat term set. The new representation can serve as a complement to the existing ISOcat model, providing additional support for authors and users in browsing, (re-)using, maintaining, and further extending the community’s terminological metadata repertoire.
11456
German subjectively veridical sicher sein ‘be certain’ can embed ob-clauses in negative contexts, while subjectively veridical glauben ‘believe’ and nonveridical möglich sein ‘be possible’ cannot. The Logical Form of F isn’t certain if M is in Rome is regarded as the negated disjunction of two sentences ¬(cf σ ∨ cf ¬σ) or ¬cf σ ∧ ¬cf ¬σ. Be certain can have this LF because ¬cf σ and ¬cf ¬σ are compatible and nonveridical. Believe excludes this LF because ¬bf σ and ¬bf ¬σ are incompatible in a question-under-discussion context. It follows from this incompatibility and from the incompatibility of bf σ and bf ¬σ that bf ¬σ and ¬bf σ are equivalent. Therefore believe cannot be nonveridical. Be possible doesn’t allow the LF either. Similar to believe, ¬pf σ and ¬pf ¬σ are incompatible. But unlike believe, pf σ and pf ¬σ are compatible.
10415
We present a method to identify and document a phenomenon on which there is very little empirical data: German phrasal compounds occurring in the form of as a single token (without punctuation between their components). Relying on linguistic criteria, our approach implies to have an operational notion of compounds which can be systematically applied as well as (web) corpora which are large and diverse enough to contain rarely seen phenomena. The method is based on word segmentation and morphological analysis, it takes advantage of a data-driven learning process. Our results show that coarse-grained identification of phrasal compounds is best performed with empirical data, whereas fine-grained detection could be improved with a combination of rule-based and frequency-based word lists. Along with the characteristics of web texts, the orthographic realizations seem to be linked to the degree of expressivity.