Trick or Neat: Adversarial Ambiguity and Language Model Evaluation
Antonia Karamolegkou, Oliver Eberle, Phillip Rust, Carina Kauf, Anders S{\o}gaard

TL;DR
This paper introduces an adversarial ambiguity dataset to evaluate language models' sensitivity to various ambiguities, revealing that probing models' internal representations can effectively decode ambiguity, unlike direct prompting.
Contribution
It presents a novel adversarial dataset for ambiguity detection and demonstrates that probing internal model representations outperforms prompting in identifying ambiguity.
Findings
Probing models can decode ambiguity with over 90% accuracy.
Direct prompting is less effective in identifying ambiguity.
Insights into how models encode ambiguity at different layers.
Abstract
Detecting ambiguity is important for language understanding, including uncertainty estimation, humour detection, and processing garden path sentences. We assess language models' sensitivity to ambiguity by introducing an adversarial ambiguity dataset that includes syntactic, lexical, and phonological ambiguities along with adversarial variations (e.g., word-order changes, synonym replacements, and random-based alterations). Our findings show that direct prompting fails to robustly identify ambiguity, while linear probes trained on model representations can decode ambiguity with high accuracy, sometimes exceeding 90\%. Our results offer insights into the prompting paradigm and how language models encode ambiguity at different layers. We release both our code and data: https://github.com/coastalcph/lm_ambiguity.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Natural Language Processing Techniques
