TL;DR
This paper investigates how large language models (LLMs) handle ambiguous Chinese text, revealing their fragility and overconfidence, and introduces a benchmark dataset to evaluate this issue.
Contribution
The authors created a benchmark dataset of ambiguous Chinese sentences with disambiguated pairs and systematically analyzed LLMs' behavior towards ambiguity.
Findings
LLMs cannot reliably distinguish ambiguous from unambiguous text.
LLMs tend to overconfidence in single-meaning interpretations.
LLMs exhibit overthinking when interpreting multiple possible meanings.
Abstract
In this work, we study a critical research problem regarding the trustworthiness of large language models (LLMs): how LLMs behave when encountering ambiguous narrative text, with a particular focus on Chinese textual ambiguity. We created a benchmark dataset by collecting and generating ambiguous sentences with context and their corresponding disambiguated pairs, representing multiple possible interpretations. These annotated examples are systematically categorized into 3 main categories and 9 subcategories. Through experiments, we discovered significant fragility in LLMs when handling ambiguity, revealing behavior that differs substantially from humans. Specifically, LLMs cannot reliably distinguish ambiguous text from unambiguous text, show overconfidence in interpreting ambiguous text as having a single meaning rather than multiple meanings, and exhibit overthinking when attempting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
