Look, Recite, Then Answer: Enhancing VLM Performance via Self-Generated Knowledge Hints
Xisheng Feng

TL;DR
This paper introduces a self-generated knowledge hint framework for vision-language models that improves performance in specialized domains by actively retrieving relevant knowledge and reducing hallucinations, all while keeping the backbone model frozen.
Contribution
The proposed 'Look, Recite, Then Answer' framework enhances VLMs with self-generated hints, decoupling inference into stages and significantly improving accuracy in domain-specific tasks.
Findings
Achieved 23.52% improvement in Weed Identification accuracy on AgroBench.
Surpassed GPT-4o performance without external search overhead.
Effectively mitigated hallucinations by active knowledge retrieval.
Abstract
Vision-Language Models (VLMs) exhibit significant performance plateaus in specialized domains like precision agriculture, primarily due to "Reasoning-Driven Hallucination" where linguistic priors override visual perception. A key bottleneck is the "Modality Gap": visual embeddings fail to reliably activate the fine-grained expert knowledge already encoded in model parameters. We propose "Look, Recite, Then Answer," a parameter-efficient framework that enhances VLMs via self-generated knowledge hints while keeping backbone models frozen. The framework decouples inference into three stages: (1) Look generates objective visual descriptions and candidate sets; (2) Recite employs a lightweight 1.7B router to transform visual cues into targeted queries that trigger candidate-specific parametric knowledge; (3) Answer performs parallel evidence alignment between descriptions and recited…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
