Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning
Tingyu Li, Zheng Sun, Jingxuan Wei, Siyuan Li, Conghui He, Lijun Wu, Cheng Tan

TL;DR
This paper introduces DoGe, a dual-decoupling framework that enhances vision-language reasoning in data-scarce domains by focusing on context learning and curriculum-based data evolution, improving model stability and generalization.
Contribution
The paper proposes a novel dual-decoupling RL framework and an evolving curriculum learning pipeline to improve self-evolving vision-language models in specialized, data-scarce domains.
Findings
Outperforms baseline methods across multiple benchmarks.
Enhances model stability and reasoning accuracy.
Facilitates scalable self-evolving vision-language models.
Abstract
Recent vision-language models (VLMs) achieve remarkable reasoning through reinforcement learning (RL), which provides a feasible solution for realizing continuous self-evolving large vision-language models (LVLMs) in the era of experience. However, RL for VLMs requires abundant high-quality multimodal data, especially challenging in specialized domains like chemistry, earth sciences, and multimodal mathematics. Existing strategies such as synthetic data and self-rewarding mechanisms suffer from limited distributions and alignment difficulties, ultimately causing reward hacking: models exploit high-reward patterns, collapsing policy entropy and destabilizing training. We propose DoGe (Decouple to Generalize), a dual-decoupling framework that guides models to first learn from context rather than problem solving by refocusing on the problem context scenarios overlooked by synthetic data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Language and cultural evolution
