AppSelectBench: Application-Level Tool Selection Benchmark

Tianyi Chen; Michael Solodko; Sen Wang; Jongwoo Ko; Junheng Hao; Colby Banbury; Sara Abdali; Saeed Amizadeh; Qing Xiao; Yinheng Li; Tianyu Ding; Kamran Ghasedi Dizaji; Suzhen Zheng; Hao Fan; Justin Wagle; Pashmina Cameron; Kazuhito Koishida

arXiv:2511.19957·cs.CL·December 1, 2025

AppSelectBench: Application-Level Tool Selection Benchmark

Tianyi Chen, Michael Solodko, Sen Wang, Jongwoo Ko, Junheng Hao, Colby Banbury, Sara Abdali, Saeed Amizadeh, Qing Xiao, Yinheng Li, Tianyu Ding, Kamran Ghasedi Dizaji, Suzhen Zheng, Hao Fan, Justin Wagle, Pashmina Cameron, Kazuhito Koishida

PDF

Open Access

TL;DR

AppSelectBench is a comprehensive benchmark designed to evaluate application-level reasoning in computer-using agents, addressing a gap in existing tools by focusing on selecting the appropriate application based on realistic user intents.

Contribution

It introduces a large-scale, diverse, and realistic benchmark with evaluation protocols for assessing application selection capabilities in CUAs, including a novel user task generation pipeline.

Findings

01

Models show systematic weaknesses in inter-application reasoning.

02

Even the most capable models struggle with consistent application choices.

03

The benchmark reveals strengths and weaknesses of large language models in application reasoning.

Abstract

Computer Using Agents (CUAs) are increasingly equipped with external tools, enabling them to perform complex and realistic tasks. For CUAs to operate effectively, application selection, which refers to deciding which application to use before invoking fine-grained tools such as APIs, is a fundamental capability. It determines whether the agent initializes the correct environment, avoids orchestration confusion, and efficiently focuses on relevant context. However, existing benchmarks primarily assess fine-grained API selection, offering limited insight into whether models can reason across and choose between different applications. To fill this gap, we introduce AppSelectBench, a comprehensive benchmark for evaluating application selection in CUAs. AppSelectBench contains a novel user task generation pipeline that produces realistic, diverse, and semantically grounded user intents at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Spreadsheets and End-User Computing · Advanced Software Engineering Methodologies