Do Transformers Understand Ancient Roman Coin Motifs Better than CNNs?
David Reid, Ognjen Arandjelovic

TL;DR
This paper compares Vision Transformers and CNNs for analyzing ancient Roman coin motifs, finding that ViT models outperform CNNs in accuracy for identifying semantic elements on coins.
Contribution
It is the first study to apply Vision Transformer models to ancient coin analysis, demonstrating their superior performance over CNNs in this domain.
Findings
ViT models outperform CNNs in accuracy.
ViT models effectively learn from multi-modal data.
The study advances automated ancient coin analysis methods.
Abstract
Automated analysis of ancient coins has the potential to help researchers extract more historical insights from large collections of coins and to help collectors understand what they are buying or selling. Recent research in this area has shown promise in focusing on identification of semantic elements as they are commonly depicted on ancient coins, by using convolutional neural networks (CNNs). This paper is the first to apply the recently proposed Vision Transformer (ViT) deep learning architecture to the task of identification of semantic elements on coins, using fully automatic learning from multi-modal data (images and unstructured text). This article summarises previous research in the area, discusses the training and implementation of ViT and CNN models for ancient coins analysis and provides an evaluation of their performance. The ViT models were found to outperform the newly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCurrency Recognition and Detection · Cold Fusion and Nuclear Reactions · Big Data and Digital Economy
