Explainable Coarse-to-Fine Ancient Manuscript Duplicates Discovery
Chongsheng Zhang, Shuwen Wu, Yingqi Chen, Yi Men, Gaojuan Fan, Matthias A{\ss}enmacher, Christian Heumann, Jo\~ao Gama

TL;DR
This paper presents a progressive framework combining image keypoints and text content analysis to identify ancient manuscript duplicates, achieving high accuracy and efficiency, and discovering new duplicates missed by experts.
Contribution
The work introduces a novel multi-level duplicate discovery method that integrates unsupervised image matching with semantic text analysis, improving detection of ancient manuscript duplicates.
Findings
Comparable recall with state-of-the-art methods
Highest simplified mean reciprocal rank scores
Discovered over 60 new OB duplicate pairs
Abstract
Ancient manuscripts are the primary source of ancient linguistic corpora. However, many ancient manuscripts exhibit duplications due to unintentional repeated publication or deliberate forgery. The Dead Sea Scrolls, for example, include counterfeit fragments, whereas Oracle Bones (OB) contain both republished materials and fabricated specimens. Identifying ancient manuscript duplicates is of great significance for both archaeological curation and ancient history study. In this work, we design a progressive OB duplicate discovery framework that combines unsupervised low-level keypoints matching with high-level text-centric content-based matching to refine and rank the candidate OB duplicates with semantic awareness and interpretability. We compare our model with state-of-the-art content-based image retrieval and image matching methods, showing that our model yields comparable recall…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Topic Modeling
