SCAN: Structured Capability Assessment and Navigation for LLMs

Zongqi Wang; Tianle Gu; Chen Gong; Xin Tian; Siqi Bao; Yujiu Yang

arXiv:2505.06698·cs.CL·May 4, 2026

SCAN: Structured Capability Assessment and Navigation for LLMs

Zongqi Wang, Tianle Gu, Chen Gong, Xin Tian, Siqi Bao, Yujiu Yang

PDF

2 Repos 1 Datasets

TL;DR

SCAN offers a detailed, hierarchical framework for evaluating and understanding the fine-grained capabilities of large language models, surpassing traditional ranking methods.

Contribution

It introduces a comprehensive, fine-grained evaluation framework with novel taxonomy extraction, query synthesis, visualization tools, and an improved LLM-as-a-Judge approach.

Findings

01

Substantial performance variation within LLM sub-capabilities

02

Fine-grained evaluation reveals nuanced model behaviors

03

The PC^2-based approach improves judgment accuracy

Abstract

Evaluating Large Language Models (LLMs) has become increasingly important, with automatic evaluation benchmarks gaining prominence as alternatives to human evaluation. While existing research has focused on approximating model rankings, such benchmarks fail to provide users and developers with a comprehensive and fine-grained understanding of a specific model's capabilities. To fill this gap, we propose \textbf{SCAN} (Structured Capability Assessment and Navigation), a practical framework that enables detailed characterization of LLM capabilities through comprehensive and fine-grained evaluation. SCAN incorporates four key components: (1) TaxBuilder, which extracts capability-indicating tags from extensive queries to construct a hierarchical taxonomy automatically; (2) RealMix, a query synthesis and filtering mechanism that ensures sufficient evaluation data for each capability tag; (3)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

DanliuDanliu/SCAN-Dataset
dataset· 53 dl
53 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.