Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards

Jiajun Fan; Roger Ren; Jingyuan Li; Rahul Pandey; Prashanth Gurunath Shivakumar; Ivan Bulyko; Ankur Gandhe; Ge Liu; Yile Gu

arXiv:2510.20867·cs.LG·October 27, 2025

Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards

Jiajun Fan, Roger Ren, Jingyuan Li, Rahul Pandey, Prashanth Gurunath Shivakumar, Ivan Bulyko, Ankur Gandhe, Ge Liu, Yile Gu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces CESAR, a reinforcement learning framework that significantly improves reasoning capabilities in Audio Large Language Models by incentivizing consistent and effective reasoning processes, overcoming previous limitations of test-time inverse scaling.

Contribution

We propose CESAR, a novel reward-based training method that enhances reasoning in Audio LLMs, addressing test-time inverse scaling and revealing optimal reasoning depths.

Findings

01

CESAR achieves state-of-the-art results on MMAU Test-mini.

02

Models trained with CESAR outperform Gemini 2.5 Pro and GPT-4o Audio.

03

Enhanced reasoning improves multimodal reasoning and perception capabilities.

Abstract

The role of reasoning in Audio Large Language Models remains widely underexplored, as introducing a reasoning process often degrades rather than improves performance during inference, a phenomenon we term test-time inverse scaling, where longer reasoning chains yield progressively worse results. We demonstrate that this stems not from fundamental limitations of reasoning itself, but from inadequate training: models without proper guidance for the reasoning process produce hallucinatory, inconsistent reasoning that accumulates errors over longer chains. To address these challenges, we introduce CESAR (Consistent, Effective, and Scalable Audio Reasoners), shifting from outcome verification to rewarding the reasoning process. Our online reinforcement learning framework employs Group Relative Policy Optimization with a multi-faceted reward suite that incentivizes not only correctness and…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

- The paper is well structured with a clear framework diagram - Moving beyond outcome‑only RL to process‑oriented rewards for Audio LLM reasoning is a good and practical direction. - Consistent gains with the proposed approach

Weaknesses

- The paper argues that introducing CoT at inference degrades accuracy in Audio LLMs unless the reasoning process is explicitly trained. However, AURELIA [1] reports improvements by injecting structured, step‑by‑step reasoning into AV‑LLMs at test time without additional training, which appears to contradict the generality of this claim. A head‑to‑head comparison (e.g., an AURELIA‑style inference‑only reasoning condition) would clarify boundaries of the inverse‑scaling effect the authors report

Reviewer 02Rating 8Confidence 4

Strengths

1. The problem statement is clear and the method is sound. 2. The paper is very neatly written and easy to read and follow. 3. The results on the datasets used show good performance improvements. 4. The Appendix (Supplementary Material) is very thorough and contains very relevant discussions.

Weaknesses

1. The ablation study shows the quantitative impact of each reward component, but a qualitative analysis is missing. It would be helpful to see examples of the reasoning text to understand how the Keywords Reward changes the model's thinking. Furthermore, the selection of the keywords feels a bit arbitrary. A clearer explanation for why these specific keywords were chosen would strengthen this part of the method. 2. The reward function has several weighted parts. The paper provides one set of we

Reviewer 03Rating 8Confidence 3

Strengths

1. The paper is technically sound. Its claims about solving "test-time inverse scaling" are strongly supported by extensive experiments, including state-of-the-art (SOTA) comparisons, comprehensive ablation studies, and "test-time scaling" analysis. 2. The paper is exceptionally well-organized with a clear, logical narrative. It provides a high degree of transparency, including pseudocode and detailed keyword tables, which significantly aids reader understanding and the ability to reproduce the

Weaknesses

1. The proposed multi-faceted reward suite introduces five new reward weights ($\alpha_j$) that must be balanced and tuned. While the authors provide a simple, effective ratio, this still represents an added layer of complexity compared to simpler reward models. 2. The readability of Figure 1 is poor. The text within the "Qualitative Comparison of Reasoning Process" section , which is meant to show examples of reasoning failures and improvements, is far too small. This makes it very difficult f

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Music and Audio Processing · Multimodal Machine Learning Applications