Sherlock: Self-Correcting Reasoning in Vision-Language Models

Yi Ding; Ruqi Zhang

arXiv:2505.22651·cs.CV·October 24, 2025

Sherlock: Self-Correcting Reasoning in Vision-Language Models

Yi Ding, Ruqi Zhang

PDF

Open Access 5 Models

TL;DR

Sherlock introduces a self-correcting training framework for vision-language models, enabling them to improve reasoning accuracy with minimal annotated data and outperform existing methods across multiple benchmarks.

Contribution

The paper presents Sherlock, a novel self-correction and self-improvement framework for reasoning VLMs, reducing reliance on large annotated datasets and enhancing generalization.

Findings

01

Achieves an average accuracy of 64.1 with direct generation and 65.4 after self-correction.

02

Outperforms existing models like LLaVA-CoT, Mulberry, and LlamaV-o1.

03

Uses less than 20% of the annotated data required by comparable methods.

Abstract

Reasoning Vision-Language Models (VLMs) have shown promising performance on complex multimodal tasks. However, they still face significant challenges: they are highly sensitive to reasoning errors, require large volumes of annotated data or accurate verifiers, and struggle to generalize beyond specific domains. To address these limitations, we explore self-correction as a strategy to enhance reasoning VLMs. We first conduct an in-depth analysis of reasoning VLMs' self-correction abilities and identify key gaps. Based on our findings, we introduce Sherlock, a self-correction and self-improvement training framework. Sherlock introduces a trajectory-level self-correction objective, a preference data construction method based on visual perturbation, and a dynamic $β$ for preference tuning. Once the model acquires self-correction capabilities using only 20k randomly sampled annotated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)