ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model

Juntian Zhang; Song Jin; Chuanqi Cheng; Yuhan Liu; Yankai Lin; Xun Zhang; Yufei Zhang; Fei Jiang; Guojun Yin; Wei Lin; Rui Yan

arXiv:2510.24285·cs.CV·October 29, 2025

ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model

Juntian Zhang, Song Jin, Chuanqi Cheng, Yuhan Liu, Yankai Lin, Xun Zhang, Yufei Zhang, Fei Jiang, Guojun Yin, Wei Lin, Rui Yan

PDF

TL;DR

ViPER introduces a self-bootstrapping framework that enhances fine-grained visual perception in vision-language models through iterative self-critique and prediction, leading to improved performance across multiple benchmarks.

Contribution

The paper presents a novel two-stage task and a self-evolution framework, ViPER, that significantly improves visual perception in VLMs without sacrificing general capabilities.

Findings

01

Qwen-Viper achieves up to 6.0% improvement on perception benchmarks

02

ViPER demonstrates consistent performance gains across diverse tasks

03

The framework enables autonomous self-improvement of perceptual abilities

Abstract

The limited capacity for fine-grained visual perception presents a critical bottleneck for Vision-Language Models (VLMs) in real-world applications. Addressing this is challenging due to the scarcity of high-quality data and the limitations of existing methods: supervised fine-tuning (SFT) often compromises general capabilities, while reinforcement fine-tuning (RFT) prioritizes textual reasoning over visual perception. To bridge this gap, we propose a novel two-stage task that structures visual perception learning as a coarse-to-fine progressive process. Based on this task formulation, we develop ViPER, a self-bootstrapping framework specifically designed to enable iterative evolution through self-critiquing and self-prediction. By synergistically integrating image-level and instance-level reconstruction with a two-stage reinforcement learning strategy, ViPER establishes a closed-loop…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.