CPPO: Contrastive Perception for Vision Language Policy Optimization
Ahmad Rezaei, Mohsen Gholami, Saeed Ranjbar Alvar, Kevin Cannons, Mohammad Asiful Hossain, Zhou Weimin, Shunbo Zhou, Yong Zhang, Mohammad Akbari

TL;DR
CPPO is a novel reinforcement learning method that improves vision-language models by using contrastive perception loss to enhance perception without extra models, leading to better performance and scalability.
Contribution
Introduces CPPO, a perception-focused RL method that detects perception tokens via entropy shifts and applies contrastive loss, avoiding additional models and improving training efficiency.
Findings
CPPO outperforms previous perception-rewarding methods.
It avoids the need for extra models, making training more scalable.
CPPO enhances perception accuracy in vision-language models.
Abstract
We introduce CPPO, a Contrastive Perception Policy Optimization method for finetuning vision-language models (VLMs). While reinforcement learning (RL) has advanced reasoning in language models, extending it to multimodal reasoning requires improving both the perception and reasoning aspects. Prior works tackle this challenge mainly with explicit perception rewards, but disentangling perception tokens from reasoning tokens is difficult, requiring extra LLMs, ground-truth data, forced separation of perception from reasoning by policy model, or applying rewards indiscriminately to all output tokens. CPPO addresses this problem by detecting perception tokens via entropy shifts in the model outputs under perturbed input images. CPPO then extends the RL objective function with a Contrastive Perception Loss (CPL) that enforces consistency under information-preserving perturbations and…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
# Strengths: 1. Entropy-based selection of perception-dependent tokens plus an unsupervised token-level InfoNCE loss is a neat way to target visual grounding without architectural surgery or stepwise supervision. The paper clearly explains the anchor/positive/negative construction and why token-level contrast makes sense for perception. 2. CPPO’s CPL does not require additional CoT supervision or proprietary models, improving simplicity and scalability. 3. On Qwen2.5-VL-3B/7B, CPPO outperform
# Weaknesses: 1. The entropy selection is heuristic. Using entropy increase as a proxy for perception dependence is intuitive but heuristic; some non-perception tokens could also exhibit entropy shifts. A deeper discussion of limitations/failure cases (show some demos) would strengthen the claim. 2. An algorithm box/flow diagram of the training loop would improve clarity. Overall, I feel the current version is a little bit hard to follow. 3. Since CPL targets perception tokens, analyze whethe
1. **Significant Problem Formulation:** The paper clearly articulates a fundamental and important problem in multimodal learning. Differentiating between perception and reasoning failures is crucial for building more robust and interpretable VLMs, and the authors provide a clear motivation for why this is a limitation in current RL-based fine-tuning paradigms. 2. **Novel Application of Contrastive Learning:** While the detection method may not be new, the application of a contrastive loss spe
1. **Inherited Flaws of a Non-Novel Detection Method:** The core idea of using input perturbations to identify salient model dependencies is a well-established principle in the model interpretability literature (e.g., similar concepts in methods, e.g., POVID [1], VCD [2], SeVA [3], and more recently, VPPO [4], [5]). The paper presents this as a novel contribution for VLM-RL, yet fails to acknowledge this prior art or, more importantly, address the known, fundamental issues with such approaches.
- The paper presents a well-defined method for locating perception tokens through entropy analysis and focuses optimization only on those regions, reducing unnecessary regularization on reasoning-related outputs. - The auxiliary loss introduces only one weighting coefficient and applies advantage gating to stabilize training, showing good compatibility with existing RL pipelines. - CPPO consistently improves over GRPO and other perception-aware baselines. Ablation studies confirm the benefits of
- While the perturbation design is central to CPPO’s effectiveness, the study only evaluates a specific configuration without examining how varying perturbation categories or intensity levels affect the learned perception robustness. - CPL appears orthogonal to GRPO, raising the question of whether it can be used as a standalone objective. It would strengthen the paper if the authors could include an experiment evaluating CPL in isolation (without GRPO) to verify whether it independently contrib
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning
