Improving Visually Grounded Sentence Representations with Self-Attention

Kang Min Yoo; Youhyun Shin; Sang-goo Lee

arXiv:1712.00609·cs.CL·December 5, 2017·6 cites

Improving Visually Grounded Sentence Representations with Self-Attention

Kang Min Yoo, Youhyun Shin, Sang-goo Lee

PDF

Open Access

TL;DR

This paper introduces a self-attention mechanism into sentence encoders to enhance visual grounding in multimodal representations, leading to improved performance on transfer tasks by better exploiting words with strong visual associations.

Contribution

The paper proposes a novel self-attention based approach to improve visual grounding in sentence representations trained with image features.

Findings

01

Self-attentive encoders enhance visual grounding capabilities.

02

Improved transfer task performance with self-attention.

03

Better exploitation of visually associated words.

Abstract

Sentence representation models trained only on language could potentially suffer from the grounding problem. Recent work has shown promising results in improving the qualities of sentence representations by jointly training them with associated image features. However, the grounding capability is limited due to distant connection between input sentences and image features by the design of the architecture. In order to further close the gap, we propose applying self-attention mechanism to the sentence encoder to deepen the grounding effect. Our results on transfer tasks show that self-attentive encoders are better for visual grounding, as they exploit specific words with strong visual associations.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques