Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning

Tingyu Li; Zheng Sun; Jingxuan Wei; Siyuan Li; Conghui He; Lijun Wu; Cheng Tan

arXiv:2512.06835·cs.AI·December 9, 2025

Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning

Tingyu Li, Zheng Sun, Jingxuan Wei, Siyuan Li, Conghui He, Lijun Wu, Cheng Tan

PDF

Open Access

TL;DR

This paper introduces DoGe, a dual-decoupling framework that enhances vision-language reasoning in data-scarce domains by focusing on context learning and curriculum-based data evolution, improving model stability and generalization.

Contribution

The paper proposes a novel dual-decoupling RL framework and an evolving curriculum learning pipeline to improve self-evolving vision-language models in specialized, data-scarce domains.

Findings

01

Outperforms baseline methods across multiple benchmarks.

02

Enhances model stability and reasoning accuracy.

03

Facilitates scalable self-evolving vision-language models.

Abstract

Recent vision-language models (VLMs) achieve remarkable reasoning through reinforcement learning (RL), which provides a feasible solution for realizing continuous self-evolving large vision-language models (LVLMs) in the era of experience. However, RL for VLMs requires abundant high-quality multimodal data, especially challenging in specialized domains like chemistry, earth sciences, and multimodal mathematics. Existing strategies such as synthetic data and self-rewarding mechanisms suffer from limited distributions and alignment difficulties, ultimately causing reward hacking: models exploit high-reward patterns, collapsing policy entropy and destabilizing training. We propose DoGe (Decouple to Generalize), a dual-decoupling framework that guides models to first learn from context rather than problem solving by refocusing on the problem context scenarios overlooked by synthetic data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Language and cultural evolution