Evaluating Chinese Ambiguity Understanding in Large Language Models
Junwen Mo, Yuanzhi Lu, Yifang Xue, Ke Xu, Hideki Nakayama

TL;DR
This paper introduces CHA-Gen, a large Chinese ambiguity dataset grounded in PA Theory, and evaluates how well large language models understand Chinese ambiguity, revealing their limitations and biases.
Contribution
It presents the first PA Theory-grounded Chinese ambiguity dataset and provides comprehensive analysis of LLMs' performance and failure modes in ambiguity detection.
Findings
LLMs struggle with ambiguity detection, improved by CoT prompting.
Qwen3-32B's rationales show ambiguity blindness, misattribution, and premature resolution.
Semantic entropy indicates higher uncertainty for ambiguous sentences.
Abstract
Linguistic ambiguity is critical to the robustness of Large Language Models (LLMs), yet existing research focuses mostly on English, with limited attention devoted to Chinese. Existing Chinese ambiguity datasets (e.g., CHAmbi) suffer from poor scalability. Guided by Potential Ambiguity (PA) Theory, we design a semi-automatic pipeline to construct CHA-Gen. It is the first PA Theory-grounded Chinese ambiguity dataset, which comprises 5,712 sentences (2,414 ambiguous, 3,298 unambiguous) across 18 potential ambiguous structures. Evaluating LLMs (e.g. Gemma 3, Qwen 2.5/3 series) via direct querying and machine translation, we find that LLMs struggle with ambiguity detection (improved by CoT prompting). Analysis of Qwen3-32B's CoT rationales reveals three common failure modes: ambiguity blindness, misattribution, and premature resolution. Uncertainty quantification with semantic entropy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
