Understanding Museum Exhibits using Vision-Language Reasoning
Ada-Astrid Balauca, Sanjana Garai, Stefan Balauca, Rasesh Udayakumar Shetty, Naitik Agrawal, Dhwanil Subhashbhai Shah, Yuqian Fu, Xi Wang, Kristina Toutanova, Danda Pani Paudel, Luc Van Gool

TL;DR
This paper introduces a large-scale dataset of museum exhibit images and questions, trains vision-language models on it, and benchmarks their ability to answer complex, context-rich museum-related queries, highlighting the importance of domain-specific fine-tuning.
Contribution
The creation of a comprehensive museum exhibit dataset and the evaluation of vision-language models' performance on real-world museum inquiry tasks, emphasizing the need for domain-specific fine-tuning.
Findings
Large vision-language models outperform smaller models in historical reasoning.
Fine-tuned models significantly surpass SOTA in attribute-based questions.
Expert-labeled dataset ensures high-quality, practical annotations.
Abstract
Museums serve as repositories of cultural heritage and historical artifacts from diverse epochs, civilizations, and regions, preserving well-documented collections that encapsulate vast knowledge, which, when systematically structured into large-scale datasets, can train specialized models. Visitors engage with exhibits through curiosity and questions, making expert domain-specific models essential for interactive query resolution and gaining historical insights. Understanding exhibits from images requires analyzing visual features and linking them to historical knowledge to derive meaningful correlations. We facilitate such reasoning by (a) collecting and curating a large-scale dataset of 65M images and 200M question-answer pairs for exhibits from all around the world; (b) training large vision-language models (VLMs) on the collected dataset; (c) benchmarking their ability on five…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMuseums and Cultural Heritage · Language, Metaphor, and Cognition · Organizational Strategy and Culture
MethodsBLIP: Bootstrapping Language-Image Pre-training
