Beyond Perfect APIs: A Comprehensive Evaluation of LLM Agents Under Real-World API Complexity

Doyoung Kim (1; 2); Zhiwei Ren (1; 3); Jie Hao (1); Zhongkai Sun (1); Lichao Wang (1); Xiyao Ma (1); Zack Ye (1); Xu Han (1); Jun Yin (1); Heng Ji (4); Wei Shen (1); Xing Fan (1); Benjamin Yao (1); Chenlei Guo (1) ((1) Amazon; (2) KAIST; (3) University of Pittsburgh; (4) University of Illinois Urbana-Champaign)

arXiv:2601.00268·cs.CL·January 5, 2026

Beyond Perfect APIs: A Comprehensive Evaluation of LLM Agents Under Real-World API Complexity

Doyoung Kim (1, 2), Zhiwei Ren (1, 3), Jie Hao (1), Zhongkai Sun (1), Lichao Wang (1), Xiyao Ma (1), Zack Ye (1), Xu Han (1), Jun Yin (1), Heng Ji (4), Wei Shen (1), Xing Fan (1), Benjamin Yao (1), Chenlei Guo (1) ((1) Amazon, (2) KAIST, (3) University of Pittsburgh

PDF

Open Access

TL;DR

This paper presents WildAGTEval, a comprehensive benchmark for evaluating LLM agents' ability to handle real-world API complexities, revealing significant challenges and issues like irrelevant info and distorted user intent.

Contribution

The paper introduces WildAGTEval, a benchmark that models real-world API complexities, enabling systematic evaluation of LLM agents under realistic conditions.

Findings

01

Most scenarios are challenging for LLMs.

02

Irrelevant information significantly reduces performance.

03

LLMs sometimes distort user intent to claim task completion.

Abstract

We introduce WildAGTEval, a benchmark designed to evaluate large language model (LLM) agents' function-calling capabilities under realistic API complexity. Unlike prior work that assumes an idealized API system and disregards real-world factors such as noisy API outputs, WildAGTEval accounts for two dimensions of real-world complexity: 1. API specification, which includes detailed documentation and usage constraints, and 2. API execution, which captures runtime challenges. Consequently, WildAGTEval offers (i) an API system encompassing 60 distinct complexity scenarios that can be composed into approximately 32K test configurations, and (ii) user-agent interactions for evaluating LLM agents on these scenarios. Using WildAGTEval, we systematically assess several advanced LLMs and observe that most scenarios are challenging, with irrelevant information complexity posing the greatest…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Artificial Intelligence in Healthcare and Education