Task-driven Visual Saliency and Attention-based Visual Question   Answering

Yuetan Lin; Zhangyang Pang; Donghui Wang; Yueting Zhuang

arXiv:1702.06700·cs.CV·February 23, 2017·22 cites

Task-driven Visual Saliency and Attention-based Visual Question Answering

Yuetan Lin, Zhangyang Pang, Donghui Wang, Yueting Zhuang

PDF

Open Access

TL;DR

This paper introduces a novel attention mechanism for visual question answering that incorporates saliency-like pre-selection and bidirectional LSTM to better model visual and textual feature interactions, improving performance.

Contribution

It proposes a new attention method combining saliency pre-selection with BiLSTM and element-wise multiplication for enhanced visual-textual feature fusion in VQA.

Findings

01

Achieved strong empirical results on COCO-VQA dataset.

02

Demonstrated improved attention focus in VQA tasks.

Abstract

Visual question answering (VQA) has witnessed great progress since May, 2015 as a classic problem unifying visual and textual data into a system. Many enlightening VQA works explore deep into the image and question encodings and fusing methods, of which attention is the most effective and infusive mechanism. Current attention based methods focus on adequate fusion of visual and textual features, but lack the attention to where people focus to ask questions about the image. Traditional attention based methods attach a single value to the feature at each spatial location, which losses many useful information. To remedy these problems, we propose a general method to perform saliency-like pre-selection on overlapped region features by the interrelation of bidirectional LSTM (BiLSTM), and use a novel element-wise multiplication based attention method to capture more competent correlation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory