WISE: A Multimodal Search Engine for Visual Scenes, Audio, Objects, Faces, Speech, and Metadata

Prasanna Sridhar; Horace Lee; David M. S. Pinto; Andrew Zisserman; Abhishek Dutta

arXiv:2602.12819·cs.IR·February 16, 2026

WISE: A Multimodal Search Engine for Visual Scenes, Audio, Objects, Faces, Speech, and Metadata

Prasanna Sridhar, Horace Lee, David M. S. Pinto, Andrew Zisserman, Abhishek Dutta

PDF

Open Access

TL;DR

WISE is an open-source multimodal search engine enabling users to perform complex, cross-modal queries on audiovisual data and metadata, supporting scalable retrieval and easy integration of new models.

Contribution

The paper introduces WISE, a versatile, scalable, and user-friendly multimodal search engine that integrates multiple retrieval modalities into a single open-source platform.

Findings

01

Supports natural-language and reverse-image queries across scenes and objects

02

Enables face, audio, speech, and metadata search with high scalability

03

Modular architecture allows easy integration of new models

Abstract

In this paper, we present WISE, an open-source audiovisual search engine which integrates a range of multimodal retrieval capabilities into a single, practical tool accessible to users without machine learning expertise. WISE supports natural-language and reverse-image queries at both the scene level (e.g. empty street) and object level (e.g. horse) across images and videos; face-based search for specific individuals; audio retrieval of acoustic events using text (e.g. wood creak) or an audio file; search over automatically transcribed speech; and filtering by user-provided metadata. Rich insights can be obtained by combining queries across modalities -- for example, retrieving German trains from a historical archive by applying the object query "train" and the metadata query "Germany", or searching for a face in a place. By employing vector search techniques, WISE can scale to support…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques