MiCo: Multi-image Contrast for Reinforcement Visual Reasoning

Xi Chen; Mingkang Zhu; Shaoteng Liu; Xiaoyang Wu; Xiaogang Xu; Yu Liu; Xiang Bai; Hengshuang Zhao

arXiv:2506.22434·cs.CV·June 30, 2025

MiCo: Multi-image Contrast for Reinforcement Visual Reasoning

Xi Chen, Mingkang Zhu, Shaoteng Liu, Xiaoyang Wu, Xiaogang Xu, Yu Liu, Xiang Bai, Hengshuang Zhao

PDF

Open Access

TL;DR

This paper introduces MiCo, a self-supervised multi-image contrastive learning approach that enhances visual reasoning in vision-language models without requiring human-annotated data, leading to improved multi-image reasoning performance.

Contribution

MiCo leverages inherent visual constraints through triplet-based self-supervised learning to improve reasoning capabilities in VLMs without manual annotations.

Findings

01

Achieves significant improvements on multi-image reasoning benchmarks.

02

Generalizes effectively to various vision tasks.

03

Operates without human-annotated question-answer pairs.

Abstract

This work explores enabling Chain-of-Thought (CoT) reasoning to link visual cues across multiple images. A straightforward solution is to adapt rule-based reinforcement learning for Vision-Language Models (VLMs). However, such methods typically rely on manually curated question-answer pairs, which can be particularly challenging when dealing with fine grained visual details and complex logic across images. Inspired by self-supervised visual representation learning, we observe that images contain inherent constraints that can serve as supervision. Based on this insight, we construct image triplets comprising two augmented views of the same image and a third, similar but distinct image. During training, the model is prompted to generate a reasoning process to compare these images (i.e., determine same or different). Then we optimize the model with rule-based reinforcement learning. Due to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis