Evaluating the Generation Capabilities of Large Chinese Language Models
Hui Zeng, Jingyuan Xue, Meng Hao, Chen Sun, Bin Ning, Na Zhang

TL;DR
This paper introduces CG-Eval, an automated framework for assessing large Chinese language models across multiple academic domains, featuring the novel Gscore metric for comprehensive performance evaluation.
Contribution
It presents the first automated, multi-domain evaluation framework for Chinese language models and introduces Gscore, a new composite index for nuanced performance measurement.
Findings
Demonstrates the effectiveness of CG-Eval across six key domains.
Provides a comparative analysis of different Chinese language models.
Offers accessible detailed test data and results online.
Abstract
This paper unveils CG-Eval, the first-ever comprehensive and automated evaluation framework designed for assessing the generative capabilities of large Chinese language models across a spectrum of academic disciplines. CG-Eval stands out for its automated process, which critically assesses models based on their proficiency in generating precise and contextually relevant responses to a diverse array of questions within six key domains: Science and Engineering, Humanities and Social Sciences, Mathematical Calculations, Medical Practitioner Qualification Examination, Judicial Examination, and Certified Public Accountant Examination. Alongside this, we introduce Gscore, an innovative composite index developed from a weighted sum of multiple metrics. Gscore uniquely automates the quality measurement of a model's text generation against reference standards, providing a detailed and nuanced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
