Balanced Thinking: Improving Chain of Thought Training in Vision Language Models

Shaked Perek; Ben Wiesel; Avihu Dekel; Nimrod Shabtay; Eli Schwartz

arXiv:2603.18656·cs.AI·March 20, 2026

Balanced Thinking: Improving Chain of Thought Training in Vision Language Models

Shaked Perek, Ben Wiesel, Avihu Dekel, Nimrod Shabtay, Eli Schwartz

PDF

Open Access

TL;DR

This paper introduces SCALe, a dynamic loss weighting method that improves vision-language model reasoning by focusing training on answer segments, leading to more accurate and concise outputs with less training time.

Contribution

SCALe is a novel training approach that adaptively balances reasoning and answer supervision, outperforming standard fine-tuning and matching more complex pipelines.

Findings

01

SCALe improves accuracy over vanilla SFT.

02

SCALe reduces training time significantly.

03

Combining SCALe with reinforcement learning yields the best results.

Abstract

Multimodal reasoning in vision-language models (VLMs) typically relies on a two-stage process: supervised fine-tuning (SFT) and reinforcement learning (RL). In standard SFT, all tokens contribute equally to the loss, even though reasoning data are inherently token-imbalanced. Long <think> traces overshadow short but task-critical <answer> segments, leading to verbose reasoning and inaccurate answers. We propose SCALe (Scheduled Curriculum Adaptive Loss), which explicitly separates supervision over reasoning and answer segments using dynamic, length-independent weighting. Unlike vanilla SFT, which overweights the <think> segment, SCALe-SFT gradually shifts the focus from <think> to <answer> throughout training via a cosine scheduling policy, encouraging concise and well-grounded reasoning. We evaluate SCALe across diverse benchmarks and architectures. Results show that SCALe consistently…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning