TL;DR
This paper introduces DSRAN, a novel graph attention-based network that enhances multi-level semantic relations between regions and global concepts to improve image-text matching accuracy.
Contribution
It proposes a dual semantic relations attention network with separate and joint modules for multi-level relation learning, advancing cross-modal representation alignment.
Findings
Outperforms previous methods on MS-COCO and Flickr30K datasets.
Effectively learns hierarchical semantic relations for better image-text matching.
Demonstrates significant improvement in matching accuracy.
Abstract
Image-Text Matching is one major task in cross-modal information processing. The main challenge is to learn the unified visual and textual representations. Previous methods that perform well on this task primarily focus on not only the alignment between region features in images and the corresponding words in sentences, but also the alignment between relations of regions and relational words. However, the lack of joint learning of regional features and global features will cause the regional features to lose contact with the global context, leading to the mismatch with those non-object words which have global meanings in some sentences. In this work, in order to alleviate this issue, it is necessary to enhance the relations between regions and the relations between regional and global concepts to obtain a more accurate visual representation so as to be better correlated to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
