Linguistically-aware Attention for Reducing the Semantic-Gap in Vision-Language Tasks
Gouthaman KV, Athira Nambiar, Kancheti Sai Srinivas, Anurag Mittal

TL;DR
This paper introduces Linguistically-aware Attention (LAT), an attention mechanism that incorporates linguistic understanding into vision-language models to bridge the semantic gap, improving performance across tasks like VQA, counting, and captioning.
Contribution
The paper proposes LAT, a novel attention mechanism that integrates object attributes and language models to enhance linguistic awareness in vision-language tasks.
Findings
Achieved state-of-the-art results on five Counting-VQA datasets.
Consistently improved baseline models for VQA and image captioning.
Demonstrated the generic effectiveness of LAT across multiple tasks.
Abstract
Attention models are widely used in Vision-language (V-L) tasks to perform the visual-textual correlation. Humans perform such a correlation with a strong linguistic understanding of the visual world. However, even the best performing attention model in V-L tasks lacks such a high-level linguistic understanding, thus creating a semantic gap between the modalities. In this paper, we propose an attention mechanism - Linguistically-aware Attention (LAT) - that leverages object attributes obtained from generic object detectors along with pre-trained language models to reduce this semantic gap. LAT represents visual and textual modalities in a common linguistically-rich space, thus providing linguistic awareness to the attention process. We apply and demonstrate the effectiveness of LAT in three V-L tasks: Counting-VQA, VQA, and Image captioning. In Counting-VQA, we propose a novel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
