Measuring the relatedness between scientific publications using controlled vocabularies
Emil Dolmer Alnor

TL;DR
This paper compares methods for measuring relatedness between scientific publications using controlled vocabularies, finding that soft cosine outperforms traditional cosine similarity in accuracy.
Contribution
Introduces two new methods, soft cosine and maximum term similarities, for better semantic relatedness measurement using controlled vocabularies.
Findings
Soft cosine is the most accurate method tested.
Traditional cosine similarity is less accurate than the new methods.
Results have implications for bibliometric analyses using controlled vocabularies.
Abstract
Measuring the relatedness between scientific publications is essential in many areas of bibliometrics and science policy. Controlled vocabularies provide a promising basis for measuring relatedness and are widely used in combination with Salton's cosine similarity. The latter is problematic because it only considers exact matches between terms. This article introduces two alternative methods - soft cosine and maximum term similarities - that account for the semantic similarity between non-matching terms. The article compares the accuracy of all three methods using the assignment of publications to topics in the TREC 2006 Genomics Track and the assumption that accurate relatedness measures should assign high relatedness scores to publication pairs within the same topic and low scores to pairs from separate topics. Results show that soft cosine is the most accurate method, while the most…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topicsscientometrics and bibliometrics research · Biomedical Text Mining and Ontologies · Computational and Text Analysis Methods
