Multi-Head Attention Driven Dynamic Visual-Semantic Embedding for   Enhanced Image-Text Matching

Wenjing Chen

arXiv:2412.19184·cs.CV·December 30, 2024

Multi-Head Attention Driven Dynamic Visual-Semantic Embedding for Enhanced Image-Text Matching

Wenjing Chen

PDF

Open Access

TL;DR

This paper introduces MH-CVSE, a multi-head self-attention based visual-semantic embedding model that enhances image-text matching by capturing complex relationships and improving training stability, demonstrated through superior results on Flickr30k.

Contribution

The study proposes a novel multi-head self-attention mechanism with dynamic loss weighting and feature fusion strategies for improved image-text matching.

Findings

01

Outperforms previous methods on Flickr30k dataset

02

Enhances understanding of complex image-text relationships

03

Achieves more stable convergence during training

Abstract

With the rapid development of multimodal learning, the image-text matching task, as a bridge connecting vision and language, has become increasingly important. Based on existing research, this study proposes an innovative visual semantic embedding model, Multi-Headed Consensus-Aware Visual-Semantic Embedding (MH-CVSE). This model introduces a multi-head self-attention mechanism based on the consensus-aware visual semantic embedding model (CVSE) to capture information in multiple subspaces in parallel, significantly enhancing the model's ability to understand and represent the complex relationship between images and texts. In addition, we adopt a parameterized feature fusion strategy to flexibly integrate feature information at different levels, further improving the model's expressive power. In terms of loss function design, the MH-CVSE model adopts a dynamic weight adjustment strategy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques

MethodsADaptive gradient method with the OPTimal convergence rate · Cosine Annealing