Retrieval-Augmented Natural Language Reasoning for Explainable Visual   Question Answering

Su Hyeon Lim; Minkuk Kim; Hyeon Bae Kim; Seong Tae Kim

arXiv:2408.17006·cs.CV·September 2, 2024

Retrieval-Augmented Natural Language Reasoning for Explainable Visual Question Answering

Su Hyeon Lim, Minkuk Kim, Hyeon Bae Kim, Seong Tae Kim

PDF

Open Access

TL;DR

This paper introduces ReRe, a retrieval-augmented model for VQA-NLE that improves answer accuracy and explanation quality by leveraging retrieval information, avoiding complex networks and additional datasets.

Contribution

ReRe is a novel retrieval-augmented encoder-decoder model that enhances VQA-NLE performance without relying on complex architectures or extra data.

Findings

01

ReRe outperforms previous methods in VQA accuracy.

02

ReRe improves explanation scores and reliability.

03

ReRe demonstrates better reasoning capabilities.

Abstract

Visual Question Answering with Natural Language Explanation (VQA-NLE) task is challenging due to its high demand for reasoning-based inference. Recent VQA-NLE studies focus on enhancing model networks to amplify the model's reasoning capability but this approach is resource-consuming and unstable. In this work, we introduce a new VQA-NLE model, ReRe (Retrieval-augmented natural language Reasoning), using leverage retrieval information from the memory to aid in generating accurate answers and persuasive explanations without relying on complex networks and extra datasets. ReRe is an encoder-decoder architecture model using a pre-trained clip vision encoder and a pre-trained GPT-2 language model as a decoder. Cross-attention layers are added in the GPT-2 for processing retrieval features. ReRe outperforms previous methods in VQA accuracy and explanation score and shows improvement in NLE…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Topic Modeling

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Multi-Head Attention · Cosine Annealing · Byte Pair Encoding · Softmax · Dropout · Adam · Layer Normalization · Weight Decay