Remodeling Semantic Relationships in Vision-Language Fine-Tuning
Xiangyang Wu, Liu Liu, Baosheng Yu, Jiayan Qiu, Zhenwei Shi

TL;DR
This paper introduces a novel vision-language fine-tuning method that enhances multimodal alignment by explicitly modeling semantic relationships and visual cues, leading to improved performance on VQA and captioning tasks.
Contribution
It proposes a new approach that extracts multilevel semantic features, groups related semantics, and uses inheritable cross-attention to better align visual and textual information.
Findings
Outperforms existing methods on eight foundation models.
Improves accuracy in visual question answering.
Enhances image captioning performance.
Abstract
Vision-language fine-tuning has emerged as an efficient paradigm for constructing multimodal foundation models. While textual context often highlights semantic relationships within an image, existing fine-tuning methods typically overlook this information when aligning vision and language, thus leading to suboptimal performance. Toward solving this problem, we propose a method that can improve multimodal alignment and fusion based on both semantics and relationships.Specifically, we first extract multilevel semantic features from different vision encoder to capture more visual cues of the relationships. Then, we learn to project the vision features to group related semantics, among which are more likely to have relationships. Finally, we fuse the visual features with the textual by using inheritable cross-attention, where we globally remove the redundant visual relationships by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling
