TL;DR
Chengyu-Bench is a comprehensive benchmark for evaluating large language models on Chinese idiom understanding, covering sentiment, contextual appropriateness, and usage completion, revealing strengths and gaps in current models.
Contribution
This paper introduces Chengyu-Bench, the first extensive benchmark for Chinese idiom tasks, highlighting the challenges LLMs face in cultural and contextual understanding.
Findings
LLMs achieve over 95% accuracy on sentiment classification.
LLMs reach about 85% accuracy on appropriateness detection.
LLMs have around 40% top-1 accuracy on open cloze tasks.
Abstract
Chinese idioms (Chengyu) are concise four-character expressions steeped in history and culture, whose literal translations often fail to capture their full meaning. This complexity makes them challenging for language models to interpret and use correctly. Existing benchmarks focus on narrow tasks - multiple-choice cloze tests, isolated translation, or simple paraphrasing. We introduce Chengyu-Bench, a comprehensive benchmark featuring three tasks: (1) Evaluative Connotation, classifying idioms as positive or negative; (2) Appropriateness, detecting incorrect idiom usage in context; and (3) Open Cloze, filling blanks in longer passages without options. Chengyu-Bench comprises 2,937 human-verified examples covering 1,765 common idioms sourced from diverse corpora. We evaluate leading LLMs and find they achieve over 95% accuracy on Evaluative Connotation, but only ~85% on Appropriateness…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
