When Do Tools and Planning Help Large Language Models Think? A Cost- and Latency-Aware Benchmark

Subha Ghoshal; Ali Al-Bustami

arXiv:2601.02663·cs.CL·March 6, 2026

When Do Tools and Planning Help Large Language Models Think? A Cost- and Latency-Aware Benchmark

Subha Ghoshal, Ali Al-Bustami

PDF

Open Access

TL;DR

This paper benchmarks the effectiveness and costs of using tools and planning in large language models for reasoning tasks, revealing that benefits depend on task specifics and model size, with significant latency trade-offs.

Contribution

It provides a comprehensive cost- and latency-aware benchmark of tool-augmented LLM reasoning across real-world tasks, highlighting when and how planning and tools improve performance.

Findings

01

Tool augmentation improves accuracy on Event-QA but increases latency significantly.

02

For CMV, simple prompting outperforms complex tool-based approaches in speed and sometimes in accuracy.

03

Task-specific, cost-aware strategies are essential for effective use of tools and planning in LLMs.

Abstract

Modern large language models (LLMs) increasingly rely on inference-time planning and external tools to improve reasoning. We benchmark this behavior on two real-world settings: event-centric question answering over graph-structured knowledge (Event-QA) and persuasive response generation in Reddit ChangeMyView (CMV). Using LangChain and LangGraph, we compare a one-shot baseline against a plan-execute-replan agent equipped with task-specific tools (DBpedia SPARQL/lookup/schema exploration, Wikipedia-focused retrieval, and topical web search). We evaluate on 60 examples each from Event-QA and CMV (3 splits of 20), and report both mean end-to-end latency and per-example token cost estimates. We evaluate GPT-4o and GPT-4o-mini under identical workflows and report accuracy and end-to-end latency. On Event-QA, the best tool-augmented configuration improves accuracy (e.g., 47.5\% $\to$ …

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Advanced Graph Neural Networks