ReLE: A Scalable System and Structured Benchmark for Diagnosing Capability Anisotropy in Chinese LLMs

Rui Fang; Jian Li; Wei Chen; Bin Hu; Ying-Cong Chen; Xin Tang; Liang Diao

arXiv:2601.17399·cs.CV·January 27, 2026

ReLE: A Scalable System and Structured Benchmark for Diagnosing Capability Anisotropy in Chinese LLMs

Rui Fang, Jian Li, Wei Chen, Bin Hu, Ying-Cong Chen, Xin Tang, Liang Diao

PDF

Open Access

TL;DR

ReLE is a scalable, structured evaluation system for diagnosing capability anisotropy in Chinese LLMs, reducing costs and revealing model specialization and trade-offs across domains and capabilities.

Contribution

The paper introduces ReLE, a novel evaluation system with a hybrid scoring mechanism and variance-aware scheduler, enabling efficient, detailed analysis of Chinese LLMs' capabilities.

Findings

01

ReLE reduces evaluation costs by 70% while maintaining high ranking correlation.

02

Models show high specialization with a Rank Stability Amplitude of 11.4.

03

Evaluation reveals significant sensitivity of rankings to weighting schemes.

Abstract

Large Language Models (LLMs) have achieved rapid progress in Chinese language understanding, yet accurately evaluating their capabilities remains challenged by benchmark saturation and prohibitive computational costs. While static leaderboards provide snapshot rankings, they often mask the structural trade-offs between capabilities. In this work, we present ReLE (Robust Efficient Live Evaluation), a scalable system designed to diagnose Capability Anisotropy, the non-uniformity of model performance across domains. Using ReLE, we evaluate 304 models (189 commercial, 115 open-source) across a Domain $\times$ Capability orthogonal matrix comprising 207,843 samples. We introduce two methodological contributions to address current evaluation pitfalls: (1) A Symbolic-Grounded Hybrid Scoring Mechanism that eliminates embedding-based false positives in reasoning tasks; (2) A Dynamic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification