SCAN: Structured Capability Assessment and Navigation for LLMs
Zongqi Wang, Tianle Gu, Chen Gong, Xin Tian, Siqi Bao, Yujiu Yang

TL;DR
SCAN offers a detailed, hierarchical framework for evaluating and understanding the fine-grained capabilities of large language models, surpassing traditional ranking methods.
Contribution
It introduces a comprehensive, fine-grained evaluation framework with novel taxonomy extraction, query synthesis, visualization tools, and an improved LLM-as-a-Judge approach.
Findings
Substantial performance variation within LLM sub-capabilities
Fine-grained evaluation reveals nuanced model behaviors
The PC^2-based approach improves judgment accuracy
Abstract
Evaluating Large Language Models (LLMs) has become increasingly important, with automatic evaluation benchmarks gaining prominence as alternatives to human evaluation. While existing research has focused on approximating model rankings, such benchmarks fail to provide users and developers with a comprehensive and fine-grained understanding of a specific model's capabilities. To fill this gap, we propose \textbf{SCAN} (Structured Capability Assessment and Navigation), a practical framework that enables detailed characterization of LLM capabilities through comprehensive and fine-grained evaluation. SCAN incorporates four key components: (1) TaxBuilder, which extracts capability-indicating tags from extensive queries to construct a hierarchical taxonomy automatically; (2) RealMix, a query synthesis and filtering mechanism that ensures sufficient evaluation data for each capability tag; (3)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
