CCTU: A Benchmark for Tool Use under Complex Constraints

Junjie Ye; Guoqiang Zhang; Wenjie Fu; Tao Gui; Qi Zhang; Xuanjing Huang

arXiv:2603.15309·cs.CL·March 17, 2026

CCTU: A Benchmark for Tool Use under Complex Constraints

Junjie Ye, Guoqiang Zhang, Wenjie Fu, Tao Gui, Qi Zhang, Xuanjing Huang

PDF

Open Access 1 Datasets

TL;DR

CCTU is a comprehensive benchmark designed to evaluate large language models' ability to use tools under complex, multi-faceted constraints, revealing significant limitations in current models' adherence and self-refinement capabilities.

Contribution

We introduce CCTU, a novel benchmark with detailed constraint categories and validation tools, enabling systematic evaluation of LLM tool use under complex constraints.

Findings

01

Models rarely achieve high success rates under strict constraints.

02

Over 50% of constraint violations occur, especially in resource and response categories.

03

Models show limited self-refinement even with detailed feedback.

Abstract

Solving problems through tool use under explicit constraints constitutes a highly challenging yet unavoidable scenario for large language models (LLMs), requiring capabilities such as function calling, instruction following, and self-refinement. However, progress has been hindered by the absence of dedicated evaluations. To address this, we introduce CCTU, a benchmark for evaluating LLM tool use under complex constraints. CCTU is grounded in a taxonomy of 12 constraint categories spanning four dimensions (i.e., resource, behavior, toolset, and response). The benchmark comprises 200 carefully curated and challenging test cases across diverse tool-use scenarios, each involving an average of seven constraint types and an average prompt length exceeding 4,700 tokens. To enable reliable evaluation, we develop an executable constraint validation module that performs step-level validation and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Junjie-Ye/CCTU
dataset· 42 dl
42 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · AI in Service Interactions