Explainable Search and Discovery of Visual Cultural Heritage Collections with Multimodal Large Language Models
Taylor Arnold, Lauren Tilton

TL;DR
This paper presents a novel approach using multimodal large language models to create explainable, flexible, and privacy-aware search and discovery interfaces for large visual cultural heritage collections, overcoming limitations of traditional visual embedding methods.
Contribution
It introduces a new multimodal LLM-based method for visual collection exploration that provides textual explanations and improved clustering, recommendation, and privacy features.
Findings
Effective clustering and recommendation demonstrated on documentary photographs
Generated concrete textual explanations for recommendations
Enhanced privacy and ethical considerations in search interfaces
Abstract
Many cultural institutions have made large digitized visual collections available online, often under permissible re-use licences. Creating interfaces for exploring and searching these collections is difficult, particularly in the absence of granular metadata. In this paper, we introduce a method for using state-of-the-art multimodal large language models (LLMs) to enable an open-ended, explainable search and discovery interface for visual collections. We show how our approach can create novel clustering and recommendation systems that avoid common pitfalls of methods based directly on visual embeddings. Of particular interest is the ability to offer concrete textual explanations of each recommendation without the need to preselect the features of interest. Together, these features can create a digital interface that is more open-ended and flexible while also being better suited to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Handwritten Text Recognition Techniques
