TL;DR
This paper introduces the TAB-VLM benchmark to evaluate how well vision-language models understand historical artifacts across different time periods, revealing significant limitations in current models' temporal reasoning.
Contribution
The work presents a new dataset and evaluation benchmark specifically designed to measure temporal reasoning in vision-language models for cultural heritage artifacts.
Findings
State-of-the-art models perform poorly on temporal reasoning tasks.
Even the best model achieves only 58.7% accuracy on the benchmark.
Performance gaps are consistent across different architectures and sizes.
Abstract
Vision-Language Models (VLMs) are increasingly applied to cultural heritage materials, from digital archives to educational platforms. This work identifies a fundamental issue in how these models interpret historical artifacts. We define this phenomenon as cultural anachronism, the tendency to misinterpret historical objects using temporally inappropriate concepts, materials, or cultural frameworks. To quantify this phenomenon, we introduce the Temporal Anachronism Benchmark for Vision-Language Models (TAB-VLM), a dataset of 600 questions across six categories, designed to evaluate temporal reasoning on 1,600 Indian cultural artifacts spanning prehistoric to modern periods. Systematic evaluations of ten state-of-the-art models reveal significant deficiencies on our benchmark, and even the best model (GPT-5.2) achieves only 58.7% overall accuracy. The performance gap persists across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
