On Generalization in Agentic Tool Calling: CoreThink Agentic Reasoner and MAVEN Dataset
Vishvesh Bhat, Omkar Ghugarkar, Julian McAuley

TL;DR
This paper evaluates large language models' ability to generalize in agentic tool-calling tasks, introduces a new challenging benchmark MAVEN, and proposes CoreThink, a symbolic reasoning framework that significantly improves performance without extra training.
Contribution
The paper introduces MAVEN, a new OOD benchmark for multi-step reasoning, and proposes CoreThink, a symbolic reasoning layer that enhances LLM generalization across diverse tool-use environments.
Findings
Most models score below 50% on MAVEN, indicating a large generalization gap.
CoreThink achieves 530% performance improvement over baselines.
CoreThink generalizes across all benchmarks without additional training.
Abstract
Generalization across Agentic tool-calling environments remains a key unsolved challenge in developing reliable agentic reasoning systems. While large language models (LLMs) demonstrate strong performance on isolated benchmarks, their ability to transfer reasoning strategies and co-ordinate tools across diverse domains is poorly understood. In this work, we conduct a large-scale evaluation of state-of-the-art LLMs on multiple tool-calling benchmarksBFCL v3, TauBench, Tau2Bench, and AceBenchand introduce MAVEN (Math & Physics Adversarial Verification & Evaluation Network), a new out of distribution (OOD) benchmark designed to stress-test multi-step reasoning through explicit verification and adversarial task composition. Our results show that most current models achieve below 50% accuracy on MAVEN, revealing a significant generalization gap across tool-use settings. To address this, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
