CASHEW: Stabilizing Multimodal Reasoning via Iterative Trajectory Aggregation

Chaoyu Li; Deeparghya Dutta Barua; Fei Tao; Pooyan Fazli

arXiv:2601.08010·cs.CV·January 14, 2026

CASHEW: Stabilizing Multimodal Reasoning via Iterative Trajectory Aggregation

Chaoyu Li, Deeparghya Dutta Barua, Fei Tao, Pooyan Fazli

PDF

Open Access

TL;DR

This paper introduces CASHEW, a framework that stabilizes multimodal reasoning by aggregating multiple reasoning trajectories and grounding them in visual evidence, significantly improving performance across various benchmarks.

Contribution

It proposes a novel inference-time and learned aggregation approach to stabilize multimodal reasoning, with training via Group Sequence Policy Optimization for robustness.

Findings

01

Up to +23.6 percentage points on ScienceQA

02

Up to +8.1 percentage points on EgoSchema

03

Significant performance improvements across 13 benchmarks

Abstract

Vision-language models achieve strong performance across a wide range of multimodal understanding and reasoning tasks, yet their multi-step reasoning remains unstable. Repeated sampling over the same input often produces divergent reasoning trajectories and inconsistent final predictions. To address this, we introduce two complementary approaches inspired by test-time scaling: (1) CASHEW, an inference-time framework that stabilizes reasoning by iteratively aggregating multiple candidate trajectories into higher-quality reasoning traces, with explicit visual verification filtering hallucinated steps and grounding reasoning in visual evidence, and (2) CASHEW-RL, a learned variant that internalizes this aggregation behavior within a single model. CASHEW-RL is trained using Group Sequence Policy Optimization (GSPO) with a composite reward that encourages correct answers grounded in minimal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications