QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation

Mengze Hong; Wailing Ng; Chen Jason Zhang; Di Jiang

arXiv:2505.05225·cs.CL·September 4, 2025

QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation

Mengze Hong, Wailing Ng, Chen Jason Zhang, Di Jiang

PDF

Open Access

TL;DR

QualBench is a comprehensive Chinese domain-specific benchmark using qualification exam questions to evaluate LLMs, revealing current performance gaps and highlighting the importance of localized knowledge and targeted improvements.

Contribution

This paper introduces QualBench, the first multi-domain Chinese QA benchmark based on qualification exams, providing a new standardized evaluation framework for Chinese LLMs.

Findings

01

Chinese LLMs outperform non-Chinese models in domain-specific tasks.

02

Average accuracy of 53.98% indicates significant room for improvement.

03

Prompt engineering and fine-tuning enhance model performance.

Abstract

The rapid advancement of Chinese LLMs underscores the need for vertical-domain evaluations to ensure reliable applications. However, existing benchmarks often lack domain coverage and provide limited insights into the Chinese working context. Leveraging qualification exams as a unified framework for expertise evaluation, we introduce QualBench, the first multi-domain Chinese QA benchmark dedicated to localized assessment of Chinese LLMs. The dataset includes over 17,000 questions across six vertical domains, drawn from 24 Chinese qualifications to align with national policies and professional standards. Results reveal an interesting pattern of Chinese LLMs consistently surpassing non-Chinese models, with the Qwen2.5 model outperforming the more advanced GPT-4o, emphasizing the value of localized domain knowledge in meeting qualification requirements. The average accuracy of 53.98%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Natural Language Processing Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Warmup With Linear Decay · Dropout · Layer Normalization · Byte Pair Encoding · Attention Dropout · Softmax · Residual Connection · WordPiece