II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in   Visual Question Answering

Jihyung Kil; Farideh Tavazoee; Dongyeop Kang; Joo-Kyung Kim

arXiv:2402.11058·cs.CV·June 4, 2024·1 cites

II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering

Jihyung Kil, Farideh Tavazoee, Dongyeop Kang, Joo-Kyung Kim

PDF

Open Access 1 Repo

TL;DR

This paper introduces II-MMR, a method to identify and enhance multi-modal multi-hop reasoning in VQA, revealing that most questions are simple and improving reasoning on complex questions using novel prompts.

Contribution

II-MMR proposes new prompts to analyze and improve multi-hop reasoning in VQA, addressing limitations of traditional Chain-of-Thought prompting.

Findings

01

Most VQA questions are single-hop reasoning.

02

II-MMR effectively improves multi-hop reasoning performance.

03

Traditional CoT struggles with complex multi-hop questions.

Abstract

Visual Question Answering (VQA) often involves diverse reasoning scenarios across Vision and Language (V&L). Most prior VQA studies, however, have merely focused on assessing the model's overall accuracy without evaluating it on different reasoning cases. Furthermore, some recent works observe that conventional Chain-of-Thought (CoT) prompting fails to generate effective reasoning for VQA, especially for complex scenarios requiring multi-hop reasoning. In this paper, we propose II-MMR, a novel idea to identify and improve multi-modal multi-hop reasoning in VQA. In specific, II-MMR takes a VQA question with an image and finds a reasoning path to reach its answer using two novel language promptings: (i) answer prediction-guided CoT prompt, or (ii) knowledge triplet-guided prompt. II-MMR then analyzes this path to identify different reasoning cases in current VQA benchmarks by estimating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

heendung/ii-mmr
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Language, Metaphor, and Cognition