Multi-Head Attention Driven Dynamic Visual-Semantic Embedding for Enhanced Image-Text Matching
Wenjing Chen

TL;DR
This paper introduces MH-CVSE, a multi-head self-attention based visual-semantic embedding model that enhances image-text matching by capturing complex relationships and improving training stability, demonstrated through superior results on Flickr30k.
Contribution
The study proposes a novel multi-head self-attention mechanism with dynamic loss weighting and feature fusion strategies for improved image-text matching.
Findings
Outperforms previous methods on Flickr30k dataset
Enhances understanding of complex image-text relationships
Achieves more stable convergence during training
Abstract
With the rapid development of multimodal learning, the image-text matching task, as a bridge connecting vision and language, has become increasingly important. Based on existing research, this study proposes an innovative visual semantic embedding model, Multi-Headed Consensus-Aware Visual-Semantic Embedding (MH-CVSE). This model introduces a multi-head self-attention mechanism based on the consensus-aware visual semantic embedding model (CVSE) to capture information in multiple subspaces in parallel, significantly enhancing the model's ability to understand and represent the complex relationship between images and texts. In addition, we adopt a parameterized feature fusion strategy to flexibly integrate feature information at different levels, further improving the model's expressive power. In terms of loss function design, the MH-CVSE model adopts a dynamic weight adjustment strategy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques
MethodsADaptive gradient method with the OPTimal convergence rate · Cosine Annealing
