The Amazing Agent Race: Strong Tool Users, Weak Navigators

Zae Myung Kim; Dongseok Lee; Jaehyung Kim; Vipul Raheja; Dongyeop Kang

arXiv:2604.10261·cs.AI·April 20, 2026

The Amazing Agent Race: Strong Tool Users, Weak Navigators

Zae Myung Kim, Dongseok Lee, Jaehyung Kim, Vipul Raheja, Dongyeop Kang

PDF

2 Repos

TL;DR

The paper introduces The Amazing Agent Race (AAR), a new benchmark with DAG puzzles to evaluate LLM agents' navigation and tool-use abilities, revealing significant navigation failures in current models.

Contribution

It presents a novel DAG-based benchmark with diverse difficulty levels and validation, highlighting navigation as a key challenge for LLM agents beyond tool execution.

Findings

01

Best agent achieves only 37.2% accuracy on AAR.

02

Navigation errors account for 27-52% of failures.

03

Agent architecture and scale significantly impact performance.

Abstract

Existing tool-use benchmarks for LLM agents are overwhelmingly linear: our analysis of six benchmarks shows 55 to 100% of instances are simple chains of 2 to 5 steps. We introduce The Amazing Agent Race (AAR), a benchmark featuring directed acyclic graph (DAG) puzzles (or "legs") with fork-merge tool chains. We release 1,400 instances across two variants: sequential (800 legs) and compositional (600 DAG legs). Agents must navigate Wikipedia, execute multi-step tool chains, and aggregate results into a verifiable answer. Legs are procedurally generated from Wikipedia seeds across four difficulty levels with live-API validation. Three complementary metrics (finish-line accuracy, pit-stop visit rate, and roadblock completion rate) separately diagnose navigation, tool-use, and arithmetic failures. Evaluating three agent frameworks on 1,400 legs, the best achieves only 37.2% accuracy.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.