VIKSER: Visual Knowledge-Driven Self-Reinforcing Reasoning Framework

Chao Wang; Chunbai Zhang; Yongxiao Tian; Yang Zhou; and Yan Peng

arXiv:2502.00711·cs.CV·September 3, 2025

VIKSER: Visual Knowledge-Driven Self-Reinforcing Reasoning Framework

Chao Wang, Chunbai Zhang, Yongxiao Tian, Yang Zhou, and Yan Peng

PDF

Open Access

TL;DR

VIKSER introduces a visual reasoning framework that leverages knowledge distillation, fine-grained visual knowledge, and self-reflection to improve interpretability and achieve state-of-the-art results on visual question answering datasets.

Contribution

The paper presents VIKSER, a novel framework that combines knowledge distillation, visual relationship detection, and self-reflection for enhanced visual reasoning interpretability and performance.

Findings

01

Achieves new state-of-the-art results on visual reasoning datasets.

02

Performs on par with leading proprietary models like ChatGPT-5.

03

Demonstrates improved interpretability through Chain-of-Evidence prompting.

Abstract

Visual reasoning refers to the task of solving questions about visual information. Current visual reasoning methods typically employ pre-trained vision-language model (VLM) strategies or deep neural network approaches. However, existing efforts are constrained by limited reasoning interpretability, while hindering by the phenomenon of underspecification in the question text. Additionally, the absence of fine-grained visual knowledge limits the precise understanding of subject behavior in visual reasoning tasks. To address these issues, we propose VIKSER (Visual Knowledge-Driven Self-Reinforcing Reasoning Framework). Specifically, VIKSER, trained using knowledge distilled from large language models, extracts fine-grained visual knowledge with the assistance of visual relationship detection techniques. Subsequently, VIKSER utilizes fine-grained visual knowledge to paraphrase the question…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Data Visualization and Analytics