Chengyu-Bench: Benchmarking Large Language Models for Chinese Idiom Understanding and Use

Yicheng Fu; Zhemin Huang; Liuxin Yang; Yumeng Lu; Zhongdongming Dai

arXiv:2506.18105·cs.CL·June 24, 2025

Chengyu-Bench: Benchmarking Large Language Models for Chinese Idiom Understanding and Use

Yicheng Fu, Zhemin Huang, Liuxin Yang, Yumeng Lu, Zhongdongming Dai

PDF

2 Videos

TL;DR

Chengyu-Bench is a comprehensive benchmark for evaluating large language models on Chinese idiom understanding, covering sentiment, contextual appropriateness, and usage completion, revealing strengths and gaps in current models.

Contribution

This paper introduces Chengyu-Bench, the first extensive benchmark for Chinese idiom tasks, highlighting the challenges LLMs face in cultural and contextual understanding.

Findings

01

LLMs achieve over 95% accuracy on sentiment classification.

02

LLMs reach about 85% accuracy on appropriateness detection.

03

LLMs have around 40% top-1 accuracy on open cloze tasks.

Abstract

Chinese idioms (Chengyu) are concise four-character expressions steeped in history and culture, whose literal translations often fail to capture their full meaning. This complexity makes them challenging for language models to interpret and use correctly. Existing benchmarks focus on narrow tasks - multiple-choice cloze tests, isolated translation, or simple paraphrasing. We introduce Chengyu-Bench, a comprehensive benchmark featuring three tasks: (1) Evaluative Connotation, classifying idioms as positive or negative; (2) Appropriateness, detecting incorrect idiom usage in context; and (3) Open Cloze, filling blanks in longer passages without options. Chengyu-Bench comprises 2,937 human-verified examples covering 1,765 common idioms sourced from diverse corpora. We evaluate leading LLMs and find they achieve over 95% accuracy on Evaluative Connotation, but only ~85% on Appropriateness…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CHENGYU-BENCH: Benchmarking Large Language Models for Chinese Idiom Understanding and Use· underline