TL;DR
EVE introduces a novel framework for self-evolving multimodal large language models by using executable visual transformations to generate verifiable training data, avoiding pseudo-labels and enabling continuous, diverse, and challenging model improvement.
Contribution
EVE presents a dual-policy architecture that synthesizes dynamic visual transformations with verified ground-truth answers, advancing self-evolution methods for MLLMs.
Findings
EVE outperforms existing self-evolution approaches in experiments.
The framework effectively maintains diversity and difficulty in training tasks.
EVE's approach ensures verifiable supervision without relying on model predictions.
Abstract
Self-evolution of multimodal large language models (MLLMs) remains a critical challenge: pseudo-label-based methods suffer from progressive quality degradation as model predictions drift, while template-based methods are confined to a static set of transformations that cannot adapt in difficulty or diversity. We contend that robust, continuous self-improvement requires not only deterministic external feedback independent of the model's internal certainty, but also a mechanism to perpetually diversify the training distribution. To this end, we introduce EVE (Executable Visual transformation-based self-Evolution), a novel framework that entirely bypasses pseudo-labels by harnessing executable visual transformations continuously enriched in both variety and complexity. EVE adopts a Challenger-Solver dual-policy architecture. The Challenger maintains and progressively expands a queue of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
