Understanding Museum Exhibits using Vision-Language Reasoning

Ada-Astrid Balauca; Sanjana Garai; Stefan Balauca; Rasesh Udayakumar Shetty; Naitik Agrawal; Dhwanil Subhashbhai Shah; Yuqian Fu; Xi Wang; Kristina Toutanova; Danda Pani Paudel; Luc Van Gool

arXiv:2412.01370·cs.CV·September 10, 2025

Understanding Museum Exhibits using Vision-Language Reasoning

Ada-Astrid Balauca, Sanjana Garai, Stefan Balauca, Rasesh Udayakumar Shetty, Naitik Agrawal, Dhwanil Subhashbhai Shah, Yuqian Fu, Xi Wang, Kristina Toutanova, Danda Pani Paudel, Luc Van Gool

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a large-scale dataset of museum exhibit images and questions, trains vision-language models on it, and benchmarks their ability to answer complex, context-rich museum-related queries, highlighting the importance of domain-specific fine-tuning.

Contribution

The creation of a comprehensive museum exhibit dataset and the evaluation of vision-language models' performance on real-world museum inquiry tasks, emphasizing the need for domain-specific fine-tuning.

Findings

01

Large vision-language models outperform smaller models in historical reasoning.

02

Fine-tuned models significantly surpass SOTA in attribute-based questions.

03

Expert-labeled dataset ensures high-quality, practical annotations.

Abstract

Museums serve as repositories of cultural heritage and historical artifacts from diverse epochs, civilizations, and regions, preserving well-documented collections that encapsulate vast knowledge, which, when systematically structured into large-scale datasets, can train specialized models. Visitors engage with exhibits through curiosity and questions, making expert domain-specific models essential for interactive query resolution and gaining historical insights. Understanding exhibits from images requires analyzing visual features and linking them to historical knowledge to derive meaningful correlations. We facilitate such reasoning by (a) collecting and curating a large-scale dataset of 65M images and 200M question-answer pairs for exhibits from all around the world; (b) training large vision-language models (VLMs) on the collected dataset; (c) benchmarking their ability on five…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Asttrid/Museum-65-v1.0
dataset· 38 dl
38 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMuseums and Cultural Heritage · Language, Metaphor, and Cognition · Organizational Strategy and Culture

MethodsBLIP: Bootstrapping Language-Image Pre-training