Linguistically-aware Attention for Reducing the Semantic-Gap in   Vision-Language Tasks

Gouthaman KV; Athira Nambiar; Kancheti Sai Srinivas; Anurag Mittal

arXiv:2008.08012·cs.CV·August 27, 2021

Linguistically-aware Attention for Reducing the Semantic-Gap in Vision-Language Tasks

Gouthaman KV, Athira Nambiar, Kancheti Sai Srinivas, Anurag Mittal

PDF

TL;DR

This paper introduces Linguistically-aware Attention (LAT), an attention mechanism that incorporates linguistic understanding into vision-language models to bridge the semantic gap, improving performance across tasks like VQA, counting, and captioning.

Contribution

The paper proposes LAT, a novel attention mechanism that integrates object attributes and language models to enhance linguistic awareness in vision-language tasks.

Findings

01

Achieved state-of-the-art results on five Counting-VQA datasets.

02

Consistently improved baseline models for VQA and image captioning.

03

Demonstrated the generic effectiveness of LAT across multiple tasks.

Abstract

Attention models are widely used in Vision-language (V-L) tasks to perform the visual-textual correlation. Humans perform such a correlation with a strong linguistic understanding of the visual world. However, even the best performing attention model in V-L tasks lacks such a high-level linguistic understanding, thus creating a semantic gap between the modalities. In this paper, we propose an attention mechanism - Linguistically-aware Attention (LAT) - that leverages object attributes obtained from generic object detectors along with pre-trained language models to reduce this semantic gap. LAT represents visual and textual modalities in a common linguistically-rich space, thus providing linguistic awareness to the attention process. We apply and demonstrate the effectiveness of LAT in three V-L tasks: Counting-VQA, VQA, and Image captioning. In Counting-VQA, we propose a novel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.