Refine
Year of publication
- 2017 (2) (remove)
Document Type
- Article (1)
- Conference Proceeding (1)
Language
- English (2) (remove)
Has Fulltext
- yes (2) (remove)
Is part of the Bibliography
- no (2) (remove)
Keywords
- Veröffentlichung (2)
- author name disambiguation (2)
- Autor (1)
- Computerlinguistik (1)
- Datensatz (1)
- Deep learning (1)
- Elektronische Bibliothek (1)
- Empirische Forschung (1)
- Homographie (1)
- Maschinelles Lernen (1)
Publicationstate
- Postprint (1)
- Veröffentlichungsversion (1)
- Zweitveröffentlichung (1)
Reviewstate
- Peer-Review (2)
Publisher
- Springer (1)
- Springer Nature (1)
We present a supervised machine learning AND system which tackles semantic similarity between publication titles by means of word embeddings. Word embeddings are integrated as external components, which keeps the model small and efficient, while allowing for easy extensibility and domain adaptation. Initial experiments show that word embeddings can improve the Recall and F score of the binary classification sub-task of AND. Results for the clustering sub-task are less clear, but also promising and overall show the feasibility of the approach.
Data sets of publication meta data with manually disambiguated author names play an important role in current author name disambiguation (AND) research. We review the most important data sets used so far, and compare their respective advantages and shortcomings. From the results of this review, we derive a set of general requirements to future AND data sets. These include both trivial requirements, like absence of errors and preservation of author order, and more substantial ones, like full disambiguation and adequate representation of publications with a small number of authors and highly variable author names. On the basis of these requirements, we create and make publicly available a new AND data set, SCAD-zbMATH. Both the quantitative analysis of this data set and the results of our initial AND experiments with a naive baseline algorithm show the SCAD-zbMATH data set to be considerably different from existing ones. We consider it a useful new resource that will challenge the state of the art in AND and benefit the AND research community.