Endogenous Reprompting: Self-Evolving Cognitive Alignment for Unified Multimodal Models
Zhenchen Tang, Songlin Yang, Zichuan Wang, Bo Peng, Yang Li, Beibei Dong, Jing Dong

TL;DR
This paper introduces Endogenous Reprompting and SEER, a novel framework enabling multimodal models to self-evaluate and improve their generation process, significantly enhancing accuracy and quality with minimal data.
Contribution
It presents a new endogenous reprompting mechanism and a training framework that uses reinforcement learning to improve multimodal model self-evaluation and generation.
Findings
SEER outperforms state-of-the-art baselines in accuracy.
SEER improves reprompting efficiency and generation quality.
The approach requires only 300 samples for training.
Abstract
Unified Multimodal Models (UMMs) exhibit strong understanding, yet this capability often fails to effectively guide generation. We identify this as a Cognitive Gap: the model lacks the understanding of how to enhance its own generation process. To bridge this gap, we propose Endogenous Reprompting, a mechanism that transforms the model's understanding from a passive encoding process into an explicit generative reasoning step by generating self-aligned descriptors during generation. To achieve this, we introduce SEER (Self-Evolving Evaluator and Reprompter), a training framework that establishes a two-stage endogenous loop using only 300 samples from a compact proxy task, Visual Instruction Elaboration. First, Reinforcement Learning with Verifiable Rewards (RLVR) activates the model's latent evaluation ability via curriculum learning, producing a high-fidelity endogenous reward signal.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
