Look, Recite, Then Answer: Enhancing VLM Performance via Self-Generated Knowledge Hints

Xisheng Feng

arXiv:2512.00882·cs.CV·December 4, 2025

Look, Recite, Then Answer: Enhancing VLM Performance via Self-Generated Knowledge Hints

Xisheng Feng

PDF

Open Access

TL;DR

This paper introduces a self-generated knowledge hint framework for vision-language models that improves performance in specialized domains by actively retrieving relevant knowledge and reducing hallucinations, all while keeping the backbone model frozen.

Contribution

The proposed 'Look, Recite, Then Answer' framework enhances VLMs with self-generated hints, decoupling inference into stages and significantly improving accuracy in domain-specific tasks.

Findings

01

Achieved 23.52% improvement in Weed Identification accuracy on AgroBench.

02

Surpassed GPT-4o performance without external search overhead.

03

Effectively mitigated hallucinations by active knowledge retrieval.

Abstract

Vision-Language Models (VLMs) exhibit significant performance plateaus in specialized domains like precision agriculture, primarily due to "Reasoning-Driven Hallucination" where linguistic priors override visual perception. A key bottleneck is the "Modality Gap": visual embeddings fail to reliably activate the fine-grained expert knowledge already encoded in model parameters. We propose "Look, Recite, Then Answer," a parameter-efficient framework that enhances VLMs via self-generated knowledge hints while keeping backbone models frozen. The framework decouples inference into three stages: (1) Look generates objective visual descriptions and candidate sets; (2) Recite employs a lightweight 1.7B router to transform visual cues into targeted queries that trigger candidate-specific parametric knowledge; (3) Answer performs parallel evidence alignment between descriptions and recited…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications