(Semi)automated disambiguation of scholarly repositories
Miriam Baglioni, Andrea Mannocci, Gina Pavone, Michele De Bonis and, Paolo Manghi

TL;DR
This paper presents a semi-automated method combining claim set analysis and clustering algorithms, validated manually, to de-duplicate scholarly repositories across multiple registries, improving data accuracy and reducing redundancy.
Contribution
It introduces a novel approach that integrates claim set merging with automated clustering, enhanced by manual validation, to improve repository deduplication accuracy.
Findings
Created a highly accurate de-duplicated repository dataset
Extended claim sets with clustering results for better deduplication
Reduced information fragmentation and redundancy in scholarly repositories
Abstract
The full exploitation of scholarly repositories is pivotal in modern Open Science, and scholarly repository registries are kingpins in enabling researchers and research infrastructures to list and search for suitable repositories. However, since multiple registries exist, repository managers are keen on registering multiple times the repositories they manage to maximise their traction and visibility across different research communities, disciplines, and applications. These multiple registrations ultimately lead to information fragmentation and redundancy on the one hand and, on the other, force registries' users to juggle multiple registries, profiles and identifiers describing the same repository. Such problems are known to registries, which claim equivalence between repository profiles whenever possible by cross-referencing their identifiers across different registries. However, as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Data Mining Algorithms and Applications · Semantic Web and Ontologies
