Trick or Neat: Adversarial Ambiguity and Language Model Evaluation

Antonia Karamolegkou; Oliver Eberle; Phillip Rust; Carina Kauf; Anders S{\o}gaard

arXiv:2506.01205·cs.CL·June 3, 2025

Trick or Neat: Adversarial Ambiguity and Language Model Evaluation

Antonia Karamolegkou, Oliver Eberle, Phillip Rust, Carina Kauf, Anders S{\o}gaard

PDF

Open Access 1 Video

TL;DR

This paper introduces an adversarial ambiguity dataset to evaluate language models' sensitivity to various ambiguities, revealing that probing models' internal representations can effectively decode ambiguity, unlike direct prompting.

Contribution

It presents a novel adversarial dataset for ambiguity detection and demonstrates that probing internal model representations outperforms prompting in identifying ambiguity.

Findings

01

Probing models can decode ambiguity with over 90% accuracy.

02

Direct prompting is less effective in identifying ambiguity.

03

Insights into how models encode ambiguity at different layers.

Abstract

Detecting ambiguity is important for language understanding, including uncertainty estimation, humour detection, and processing garden path sentences. We assess language models' sensitivity to ambiguity by introducing an adversarial ambiguity dataset that includes syntactic, lexical, and phonological ambiguities along with adversarial variations (e.g., word-order changes, synonym replacements, and random-based alterations). Our findings show that direct prompting fails to robustly identify ambiguity, while linear probes trained on model representations can decode ambiguity with high accuracy, sometimes exceeding 90\%. Our results offer insights into the prompting paradigm and how language models encode ambiguity at different layers. We release both our code and data: https://github.com/coastalcph/lm_ambiguity.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Trick or Neat: Adversarial Ambiguity and Language Model Evaluation· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Natural Language Processing Techniques