Explainable Search and Discovery of Visual Cultural Heritage Collections   with Multimodal Large Language Models

Taylor Arnold; Lauren Tilton

arXiv:2411.04663·cs.CV·November 8, 2024·2 cites

Explainable Search and Discovery of Visual Cultural Heritage Collections with Multimodal Large Language Models

Taylor Arnold, Lauren Tilton

PDF

Open Access

TL;DR

This paper presents a novel approach using multimodal large language models to create explainable, flexible, and privacy-aware search and discovery interfaces for large visual cultural heritage collections, overcoming limitations of traditional visual embedding methods.

Contribution

It introduces a new multimodal LLM-based method for visual collection exploration that provides textual explanations and improved clustering, recommendation, and privacy features.

Findings

01

Effective clustering and recommendation demonstrated on documentary photographs

02

Generated concrete textual explanations for recommendations

03

Enhanced privacy and ethical considerations in search interfaces

Abstract

Many cultural institutions have made large digitized visual collections available online, often under permissible re-use licences. Creating interfaces for exploring and searching these collections is difficult, particularly in the absence of granular metadata. In this paper, we introduce a method for using state-of-the-art multimodal large language models (LLMs) to enable an open-ended, explainable search and discovery interface for visual collections. We show how our approach can create novel clustering and recommendation systems that avoid common pitfalls of methods based directly on visual embeddings. Of particular interest is the ability to offer concrete textual explanations of each recommendation without the need to preselect the features of interest. Together, these features can create a digital interface that is more open-ended and flexible while also being better suited to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Handwritten Text Recognition Techniques