Retrieval-Augmented Natural Language Reasoning for Explainable Visual Question Answering
Su Hyeon Lim, Minkuk Kim, Hyeon Bae Kim, Seong Tae Kim

TL;DR
This paper introduces ReRe, a retrieval-augmented model for VQA-NLE that improves answer accuracy and explanation quality by leveraging retrieval information, avoiding complex networks and additional datasets.
Contribution
ReRe is a novel retrieval-augmented encoder-decoder model that enhances VQA-NLE performance without relying on complex architectures or extra data.
Findings
ReRe outperforms previous methods in VQA accuracy.
ReRe improves explanation scores and reliability.
ReRe demonstrates better reasoning capabilities.
Abstract
Visual Question Answering with Natural Language Explanation (VQA-NLE) task is challenging due to its high demand for reasoning-based inference. Recent VQA-NLE studies focus on enhancing model networks to amplify the model's reasoning capability but this approach is resource-consuming and unstable. In this work, we introduce a new VQA-NLE model, ReRe (Retrieval-augmented natural language Reasoning), using leverage retrieval information from the memory to aid in generating accurate answers and persuasive explanations without relying on complex networks and extra datasets. ReRe is an encoder-decoder architecture model using a pre-trained clip vision encoder and a pre-trained GPT-2 language model as a decoder. Cross-attention layers are added in the GPT-2 for processing retrieval features. ReRe outperforms previous methods in VQA accuracy and explanation score and shows improvement in NLE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Multi-Head Attention · Cosine Annealing · Byte Pair Encoding · Softmax · Dropout · Adam · Layer Normalization · Weight Decay
