QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation
Mengze Hong, Wailing Ng, Chen Jason Zhang, Di Jiang

TL;DR
QualBench is a comprehensive Chinese domain-specific benchmark using qualification exam questions to evaluate LLMs, revealing current performance gaps and highlighting the importance of localized knowledge and targeted improvements.
Contribution
This paper introduces QualBench, the first multi-domain Chinese QA benchmark based on qualification exams, providing a new standardized evaluation framework for Chinese LLMs.
Findings
Chinese LLMs outperform non-Chinese models in domain-specific tasks.
Average accuracy of 53.98% indicates significant room for improvement.
Prompt engineering and fine-tuning enhance model performance.
Abstract
The rapid advancement of Chinese LLMs underscores the need for vertical-domain evaluations to ensure reliable applications. However, existing benchmarks often lack domain coverage and provide limited insights into the Chinese working context. Leveraging qualification exams as a unified framework for expertise evaluation, we introduce QualBench, the first multi-domain Chinese QA benchmark dedicated to localized assessment of Chinese LLMs. The dataset includes over 17,000 questions across six vertical domains, drawn from 24 Chinese qualifications to align with national policies and professional standards. Results reveal an interesting pattern of Chinese LLMs consistently surpassing non-Chinese models, with the Qwen2.5 model outperforming the more advanced GPT-4o, emphasizing the value of localized domain knowledge in meeting qualification requirements. The average accuracy of 53.98%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Natural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Warmup With Linear Decay · Dropout · Layer Normalization · Byte Pair Encoding · Attention Dropout · Softmax · Residual Connection · WordPiece
