Task-driven Visual Saliency and Attention-based Visual Question Answering
Yuetan Lin, Zhangyang Pang, Donghui Wang, Yueting Zhuang

TL;DR
This paper introduces a novel attention mechanism for visual question answering that incorporates saliency-like pre-selection and bidirectional LSTM to better model visual and textual feature interactions, improving performance.
Contribution
It proposes a new attention method combining saliency pre-selection with BiLSTM and element-wise multiplication for enhanced visual-textual feature fusion in VQA.
Findings
Achieved strong empirical results on COCO-VQA dataset.
Demonstrated improved attention focus in VQA tasks.
Abstract
Visual question answering (VQA) has witnessed great progress since May, 2015 as a classic problem unifying visual and textual data into a system. Many enlightening VQA works explore deep into the image and question encodings and fusing methods, of which attention is the most effective and infusive mechanism. Current attention based methods focus on adequate fusion of visual and textual features, but lack the attention to where people focus to ask questions about the image. Traditional attention based methods attach a single value to the feature at each spatial location, which losses many useful information. To remedy these problems, we propose a general method to perform saliency-like pre-selection on overlapped region features by the interrelation of bidirectional LSTM (BiLSTM), and use a novel element-wise multiplication based attention method to capture more competent correlation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
