Improving Visually Grounded Sentence Representations with Self-Attention
Kang Min Yoo, Youhyun Shin, Sang-goo Lee

TL;DR
This paper introduces a self-attention mechanism into sentence encoders to enhance visual grounding in multimodal representations, leading to improved performance on transfer tasks by better exploiting words with strong visual associations.
Contribution
The paper proposes a novel self-attention based approach to improve visual grounding in sentence representations trained with image features.
Findings
Self-attentive encoders enhance visual grounding capabilities.
Improved transfer task performance with self-attention.
Better exploitation of visually associated words.
Abstract
Sentence representation models trained only on language could potentially suffer from the grounding problem. Recent work has shown promising results in improving the qualities of sentence representations by jointly training them with associated image features. However, the grounding capability is limited due to distant connection between input sentences and image features by the design of the architecture. In order to further close the gap, we propose applying self-attention mechanism to the sentence encoder to deepen the grounding effect. Our results on transfer tasks show that self-attentive encoders are better for visual grounding, as they exploit specific words with strong visual associations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
