On Generalization in Agentic Tool Calling: CoreThink Agentic Reasoner and MAVEN Dataset

Vishvesh Bhat; Omkar Ghugarkar; Julian McAuley

arXiv:2510.22898·cs.AI·October 28, 2025

On Generalization in Agentic Tool Calling: CoreThink Agentic Reasoner and MAVEN Dataset

Vishvesh Bhat, Omkar Ghugarkar, Julian McAuley

PDF

TL;DR

This paper evaluates large language models' ability to generalize in agentic tool-calling tasks, introduces a new challenging benchmark MAVEN, and proposes CoreThink, a symbolic reasoning framework that significantly improves performance without extra training.

Contribution

The paper introduces MAVEN, a new OOD benchmark for multi-step reasoning, and proposes CoreThink, a symbolic reasoning layer that enhances LLM generalization across diverse tool-use environments.

Findings

01

Most models score below 50% on MAVEN, indicating a large generalization gap.

02

CoreThink achieves 530% performance improvement over baselines.

03

CoreThink generalizes across all benchmarks without additional training.

Abstract

Generalization across Agentic tool-calling environments remains a key unsolved challenge in developing reliable agentic reasoning systems. While large language models (LLMs) demonstrate strong performance on isolated benchmarks, their ability to transfer reasoning strategies and co-ordinate tools across diverse domains is poorly understood. In this work, we conduct a large-scale evaluation of state-of-the-art LLMs on multiple tool-calling benchmarksBFCL v3, TauBench, Tau2Bench, and AceBenchand introduce MAVEN (Math & Physics Adversarial Verification & Evaluation Network), a new out of distribution (OOD) benchmark designed to stress-test multi-step reasoning through explicit verification and adversarial task composition. Our results show that most current models achieve below 50% accuracy on MAVEN, revealing a significant generalization gap across tool-use settings. To address this, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.