From Hindsight to Foresight: Self-Encouraged Hindsight Distillation for Knowledge-based Visual Question Answering
Yu Zhao, Ying Zhang, Xuhui Sui, Baohang Zhou, Li Shen, Dacheng Tao

TL;DR
This paper introduces HinD, a framework that enhances knowledge reasoning in large language models for visual question answering by self-distillation and knowledge encouragement, improving performance without external retrieval.
Contribution
It proposes a novel self-encouraged reasoning framework with a Hindsight Teacher and Foresight Student, enabling explicit multi-step reasoning in MLLMs for KBVQA.
Findings
HinD improves KBVQA accuracy on OK-VQA and A-OKVQA datasets.
The method enables 7-8B MLLMs to outperform some larger models.
HinD does not rely on external knowledge retrieval or commercial APIs.
Abstract
Knowledge-based Visual Question Answering (KBVQA) necessitates external knowledge incorporation beyond cross-modal understanding. Existing KBVQA methods either utilize implicit knowledge in multimodal large language models (MLLMs) via in-context learning or explicit knowledge via retrieval augmented generation. However, their reasoning processes remain implicit, without explicit multi-step trajectories from MLLMs. To address this gap, we provide a Hindsight Distilled Reasoning (HinD) framework with Knowledge Encouragement Preference Optimization, aiming at self-encouraging the knowledge reasoning ability inside the MLLM. First, we construct the Hindsight Teacher by prompting the MLLM to complete the reasoning process with knowing the right answer, obtaining Hindsight-Zero training data. Then, the Foresight Student, without knowing the answer, learns the golden trajectories from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
