On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs

Rosie Zhao; Anshul Shah; Xiaoyu Zhu; Xinke Deng; Zhongyu Jiang; Yang Yang; Joerg Liebelt; Arnab Mondal

arXiv:2602.12506·cs.LG·May 22, 2026

On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs

Rosie Zhao, Anshul Shah, Xiaoyu Zhu, Xinke Deng, Zhongyu Jiang, Yang Yang, Joerg Liebelt, Arnab Mondal

PDF

3 Reviews

TL;DR

This paper investigates the robustness and consistency of RL-finetuned vision-language models, revealing vulnerabilities to textual perturbations and trade-offs between accuracy and faithfulness in reasoning tasks.

Contribution

It provides a detailed analysis of the limitations of current open-source RL-finetuned VLMs, highlighting the need for evaluation protocols that balance correctness, robustness, and faithfulness.

Findings

01

RL finetuning improves benchmark accuracy but reduces reasoning robustness.

02

Simple textual perturbations cause significant drops in model confidence and consistency.

03

Adversarial augmentation alone does not prevent faithfulness drift.

Abstract

Reinforcement learning (RL) finetuning has become a key technique for enhancing large language models (LLMs) on reasoning-intensive tasks, motivating its extension to vision-language models (VLMs). While RL-tuned VLMs improve on visual reasoning benchmarks, they remain vulnerable to weak visual grounding, hallucinations, and over-reliance on textual cues. We show that simple, controlled textual perturbations, including misleading captions or incorrect chain-of-thought (CoT) traces, cause substantial drops in robustness and confidence, and that these effects are more pronounced when CoT consistency is taken into account across open-source multimodal reasoning models. In contrast, closed models exhibit similar failure modes but maintain markedly greater robustness and reasoning consistency, suggesting that the gap reflects a shortcoming in current open-source RL finetuning rather than an…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

+ Proposes a clear and reproducible textual perturbation framework to probe VLM robustness. + Identifies a consistent accuracy–faithfulness trade-off during RL fine-tuning. + Covers a wide range of recent RL-based multimodal reasoning models and benchmarks. + Analysis is careful, and the findings are both timely and practically relevant.

Weaknesses

+ The paper remains primarily empirical, without a formal theoretical explanation or principled model-level intervention to mitigate the observed trade-off. + The faithfulness-as-reward experiments, though conceptually interesting, are underexplored; their instability and optimization dynamics merit deeper quantitative analysis. + The study relies on a single large-language-model judge (Qwen3-32B) to assess reasoning faithfulness, which may introduce evaluation bias; cross-validation with othe

Reviewer 02Rating 4Confidence 3

Strengths

Focusing on robustness and faithful, visually grounded reasoning, the paper uses simple, controlled textual perturbations to effectively probe modality conflict. By analyzing training dynamics, the paper indicates an accuracy–faithfulness tradeoff, shows that augmentation improves robustness while faithfulness continues to drift, and finds that adding faithfulness to the reward aligns CoT with answers yet becomes unstable when combined with augmentation, yielding limited robustness gains.

Weaknesses

The paper mainly reveals the accuracy–faithfulness disconnect and sensitivity to textual perturbations, but does not provide training or inference method that can be readily reused. The augmentation strategies with wrong-think and wrong-caption yield clear in-distribution improvements, but evidence for transfer across datasets and tasks is limited. Out-of-distribution performance is under-reported, including results on different data sources and task types Formatting error: “n Appendix D.1 we sh

Reviewer 03Rating 2Confidence 4

Strengths

The paper focuses on an important question—how Reinforcement Finetuning affects the reasoning faithfulness of large vision-language models.

Weaknesses

1. The overall contribution of this paper is limited. It lacks technical innovation, does not present a solid benchmark, and offers no particularly insightful experimental findings. a. The paper primarily analyzes existing RL-trained VLMs through text perturbations. However, such perturbation-based evaluation has been extensively explored in prior works, such as [1]. b. The benchmark proposed in this paper is mostly an extension of existing datasets with additional annotations of incor

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI