MVAM: Multi-View Attention Method for Fine-grained Image-Text Matching
Wanqing Cui, Rui Cheng, Jiafeng Guo, Xueqi Cheng

TL;DR
This paper introduces MVAM, a multi-view attention approach that improves image-text matching by capturing diverse, fine-grained details through multiple attention heads with a diversity objective, leading to better retrieval accuracy.
Contribution
The paper proposes a novel multi-view attention method with diversity constraints to enhance two-stream models for more comprehensive image-text representations.
Findings
Improved performance on MSCOCO and Flickr30K datasets.
Attention heads focus on distinct content aspects.
Enhanced fine-grained matching accuracy.
Abstract
Existing two-stream models, such as CLIP, encode images and text through independent representations, showing good performance while ensuring retrieval speed, have attracted attention from industry and academia. However, the single representation often struggles to capture complex content fully. Such models may ignore fine-grained information during matching, resulting in suboptimal retrieval results. To overcome this limitation and enhance the performance of two-stream models, we propose a Multi-view Attention Method (MVAM) for image-text matching. This approach leverages diverse attention heads with unique view codes to learn multiple representations for images and text, which are then concatenated for matching. We also incorporate a diversity objective to explicitly encourage attention heads to focus on distinct aspects of the input data, capturing complementary fine-grained details.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Image Retrieval and Classification Techniques · Image Processing and 3D Reconstruction
MethodsFocus · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
