Sherlock: Self-Correcting Reasoning in Vision-Language Models
Yi Ding, Ruqi Zhang

TL;DR
Sherlock introduces a self-correcting training framework for vision-language models, enabling them to improve reasoning accuracy with minimal annotated data and outperform existing methods across multiple benchmarks.
Contribution
The paper presents Sherlock, a novel self-correction and self-improvement framework for reasoning VLMs, reducing reliance on large annotated datasets and enhancing generalization.
Findings
Achieves an average accuracy of 64.1 with direct generation and 65.4 after self-correction.
Outperforms existing models like LLaVA-CoT, Mulberry, and LlamaV-o1.
Uses less than 20% of the annotated data required by comparable methods.
Abstract
Reasoning Vision-Language Models (VLMs) have shown promising performance on complex multimodal tasks. However, they still face significant challenges: they are highly sensitive to reasoning errors, require large volumes of annotated data or accurate verifiers, and struggle to generalize beyond specific domains. To address these limitations, we explore self-correction as a strategy to enhance reasoning VLMs. We first conduct an in-depth analysis of reasoning VLMs' self-correction abilities and identify key gaps. Based on our findings, we introduce Sherlock, a self-correction and self-improvement training framework. Sherlock introduces a trajectory-level self-correction objective, a preference data construction method based on visual perturbation, and a dynamic for preference tuning. Once the model acquires self-correction capabilities using only 20k randomly sampled annotated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)
