Seeing as Experts Do: A Knowledge-Augmented Agent for Open-Set Fine-Grained Visual Understanding
Junhan Chen, Zilu Zhou, Yujun Tong, Dongliang Chang, Yitao Luo, and Zhanyu Ma

TL;DR
This paper introduces KFRA, a knowledge-augmented reasoning agent that enhances open-set fine-grained visual understanding by integrating retrieval, localization, and multimodal reasoning for more accurate and interpretable results.
Contribution
KFRA is a novel unified framework that couples retrieval and reasoning processes, enabling factual and interpretable analysis in diverse fine-grained visual tasks.
Findings
KFRA outperforms existing models with up to 19% higher reasoning accuracy.
The construction of FGExpertBench provides a new standard for evaluating reasoning depth.
KFRA demonstrates strong generalization across multiple knowledge dimensions.
Abstract
Fine-grained visual understanding is shifting from static classification to knowledge-augmented reasoning, where models must justify as well as recognise. Existing approaches remain limited by closed-set taxonomies and single-label prediction, leading to significant degradation under open-set or context-dependent conditions. We present the Knowledge-Augmented Fine-Grained Reasoning Agent (KFRA), a unified framework that transforms fine-grained perception into evidence-driven reasoning. KFRA operates through a three-stage closed reasoning loop that emulates expert analysis. It first performs open-vocabulary detection and web-scale retrieval to generate category hypotheses. It then conducts discriminative regions localisation by aligning textual knowledge with visual evidence through a global-to-local focusing mechanism. Finally, it integrates all multimodal evidence within a large…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks
