TL;DR
REALISTA introduces a novel latent-space attack framework that generates realistic, semantically coherent prompts to effectively elicit hallucinations in large language models, surpassing previous methods.
Contribution
It proposes a new method combining discrete and continuous prompt attacks via a semantic dictionary, improving realism and attack success on large language models.
Findings
REALISTA outperforms existing realistic attack methods.
It successfully attacks large reasoning models in free-form response settings.
The framework achieves comparable or superior results on open-source LLMs.
Abstract
Large language models (LLMs) achieve strong performance across many tasks but remain vulnerable to hallucinations, motivating the need for realistic adversarial prompts that elicit such failures. We formulate hallucination elicitation as a constrained optimization problem, where the goal is to find semantically coherent adversarial prompts that are equivalent to benign user prompts. Existing methods remain limited: discrete prompt-based attacks preserve semantic equivalence and coherence but search only over a limited set of prompt variations, while continuous latent-space attacks explore a richer space but often decode into prompts that are no longer valid rephrasings. To address these limitations, we propose REALISTA, a realistic latent-space attack framework. REALISTA constructs an input-dependent dictionary of valid editing directions, each corresponding to a semantically equivalent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
