Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents
Xuan Qi

TL;DR
This study reveals that brief chain-of-thought reasoning significantly improves function-calling accuracy in language agents, while extended reasoning can impair performance, leading to the proposal of a structured brief-CoT method for better reliability.
Contribution
It uncovers the non-monotonic effects of reasoning length on agent accuracy and introduces Function-Routing CoT (FR-CoT), a structured approach that enhances reliability without budget tuning.
Findings
Brief reasoning (32 tokens) boosts accuracy by 45%.
Long reasoning (256 tokens) degrades performance below no-CoT baseline.
FR-CoT reduces hallucinations to 0% and maintains accuracy at brief reasoning levels.
Abstract
How much should a language agent think before taking action? Chain-of-thought (CoT) reasoning is widely assumed to improve agent performance, but the relationship between reasoning length and accuracy in structured tool-use settings remains poorly understood. We present a systematic study of CoT budget effects on function-calling agents, sweeping six token budgets (0--512) across 200 tasks from the Berkeley Function Calling Leaderboard v3 Multiple benchmark. Our central finding is a striking non-monotonic pattern on Qwen2.5-1.5B-Instruct: brief reasoning (32 tokens) dramatically improves accuracy by 45% relative over direct answers, from 44.0% to 64.0%, while extended reasoning (256 tokens) degrades performance well below the no-CoT baseline, to 25.0% (McNemar p < 0.001). A three-way error decomposition reveals the mechanism. At d = 0, 30.5% of tasks fail because the model selects the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
