R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning
Zirui Zhang, Haoyu Dong, Kexin Pei, Chengzhi Mao

TL;DR
This paper introduces RC2, a reinforcement learning framework that enforces cycle consistency across modalities to improve multimodal reasoning accuracy without relying on labeled data.
Contribution
The paper proposes a novel cycle-consistent reinforcement learning approach that aligns internal representations across modalities, enhancing reasoning performance.
Findings
Improves reasoning accuracy by up to 7.6 points.
Enables autonomous alignment of internal representations.
Reduces modality-specific errors.
Abstract
Robust perception and reasoning require consistency across sensory modalities. Yet current multimodal models often violate this principle, yielding contradictory predictions for visual and textual representations of the same concept. Rather than masking these failures with standard voting mechanisms, which can amplify systematic biases, we show that cross-modal inconsistency provides a rich and natural signal for learning. We introduce RC2, a reinforcement learning framework that resolves internal conflicts by enforcing cross-modal cycle consistency. By requiring a model to perform backward inference, switch modalities, and reliably reconstruct the answer through forward inference, we obtain a dense, label-free reward. This cyclic constraint encourages the model to align its internal representations autonomously. Optimizing for this structure mitigates modality-specific errors and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Multisensory perception and integration · Action Observation and Synchronization
