Graph-Based Self-Healing Tool Routing for Cost-Efficient LLM Agents
Neeraj Bholani

TL;DR
This paper introduces Self-Healing Router, a fault-tolerant system for LLM agent tool routing that reduces costs and improves reliability by automatically rerouting around failures without additional LLM calls.
Contribution
It presents a novel runtime fault-tolerance mechanism using deterministic shortest-path routing and binary observability, unlike prior systems focused on planning.
Findings
Reduces control-plane LLM calls by 93%
Eliminates silent failures in static workflows
Matches correctness of existing methods
Abstract
Tool-using LLM agents face a reliability-cost tradeoff: routing every decision through the LLM improves correctness but incurs high latency and inference cost, while pre-coded workflow graphs reduce cost but become brittle under unanticipated compound tool failures. We present Self-Healing Router, a fault-tolerant orchestration architecture that treats most agent control-flow decisions as routing rather than reasoning. The system combines (i) parallel health monitors that assign priority scores to runtime conditions such as tool outages and risk signals, and (ii) a cost-weighted tool graph where Dijkstra's algorithm performs deterministic shortest-path routing. When a tool fails mid-execution, its edges are reweighted to infinity and the path is recomputed -- yielding automatic recovery without invoking the LLM. The LLM is reserved exclusively for cases where no feasible path exists,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Software System Performance and Reliability · Software-Defined Networks and 5G
