Detecting Media Clones in Cultural Repositories Using a Positive Unlabeled Learning Approach
V. Sevetlidis, V. Arampatzakis, M. Karta, I. Mourthos, D. Tsiafaki, G. Pavlidis

TL;DR
This paper introduces a positive-unlabeled learning method for detecting duplicate media in cultural repositories, improving accuracy and interpretability over existing approaches.
Contribution
It formulates curator-in-the-loop duplicate detection as a PU learning problem and demonstrates significant performance improvements on CIFAR-10 and AtticPOT datasets.
Findings
Achieved F1=96.37 on CIFAR-10 and F1=90.79 on AtticPOT
Improved F1 by +7.70 points over the SVDD baseline
Provides an interpretable threshold and avoids explicit negatives
Abstract
We formulate curator-in-the-loop duplicate discovery in the AtticPOT repository as a Positive-Unlabeled (PU) learning problem. Given a single anchor per artefact, we train a lightweight per-query Clone Encoder on augmented views of the anchor and score the unlabeled repository with an interpretable threshold on the latent l_2 norm. The system proposes candidates for curator verification, uncovering cross-record duplicates that were not verified a priori. On CIFAR-10 we obtain F1=96.37 (AUROC=97.97); on AtticPOT we reach F1=90.79 (AUROC=98.99), improving F1 by +7.70 points over the best baseline (SVDD) under the same lightweight backbone. Qualitative "find-similar" panels show stable neighbourhoods across viewpoint and condition. The method avoids explicit negatives, offers a transparent operating point, and fits de-duplication, record linkage, and curator-in-the-loop workflows.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
