MUST: An Effective and Scalable Framework for Multimodal Search of Target Modality
Mengzhao Wang, Xiangyu Ke, Xiaoliang Xu, Lu Chen, Yunjun Gao, Pinpin, Huang, Runkai Zhu

TL;DR
MUST is a scalable, efficient framework for multimodal search that intelligently fuses multiple modalities using learned weights and a fused proximity graph, significantly improving accuracy and speed over baseline methods.
Contribution
The paper introduces MUST, a novel multimodal search framework that employs hybrid fusion, vector weight learning, and a fused proximity graph for improved accuracy and efficiency.
Findings
Achieves over 10x faster search times compared to baselines.
Attains an average of 93% higher accuracy in multimodal retrieval.
Scales effectively to datasets with over 10 million elements.
Abstract
We investigate the problem of multimodal search of target modality, where the task involves enhancing a query in a specific target modality by integrating information from auxiliary modalities. The goal is to retrieve relevant objects whose contents in the target modality match the specified multimodal query. The paper first introduces two baseline approaches that integrate techniques from the Database, Information Retrieval, and Computer Vision communities. These baselines either merge the results of separate vector searches for each modality or perform a single-channel vector search by fusing all modalities. However, both baselines have limitations in terms of efficiency and accuracy as they fail to adequately consider the varying importance of fusing information across modalities. To overcome these limitations, the paper proposes a novel framework, called MUST. Our framework employs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Remote-Sensing Image Classification
