RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning

Suhang Hu; Wei Hu; Yuhang Su; Fan Zhang

arXiv:2508.13229·cs.LG·September 16, 2025

RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning

Suhang Hu, Wei Hu, Yuhang Su, Fan Zhang

PDF

Open Access

TL;DR

RISE is a two-stage self-supervised framework that improves vision-language models' reasoning and annotation accuracy on complex tasks by generating and leveraging high-quality, verified chains of thought.

Contribution

The paper introduces RISE, a novel self-supervised approach that enhances VLM reasoning and annotation through reinforcement learning and high-quality reasoning data.

Findings

01

Outperforms SFT and Visual-RFT on complex annotation tasks

02

Produces interpretable reasoning and accurate annotations

03

Achieves robust performance and explainability

Abstract

Vision-Language Models (VLMs) struggle with complex image annotation tasks, such as emotion classification and context-driven object detection, which demand sophisticated reasoning. Standard Supervised Fine-Tuning (SFT) focuses solely on annotation outcomes, ignoring underlying rationales, while Visual Reinforcement Fine-Tuning (Visual-RFT) produces inconsistent Chains of Thought (CoTs) due to the absence of high-quality, verified CoTs during pre-training. We introduce RISE (Reason-Inspire-Strengthen-Expertise), a two-stage framework to overcome these limitations. In the Reason stage (RISE-CoT), a reinforcement learning-driven "annotation-reasoning-annotation" closed-loop generates visually grounded, logically consistent CoTs by verifying their ability to reconstruct original annotations without direct leakage. The Inspire and Strengthen stage (RISE-R1) leverages a high-quality CoT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRetinal Imaging and Analysis