Are LLMs Vulnerable to Preference-Undermining Attacks (PUA)? A Factorial Analysis Methodology for Diagnosing the Trade-off between Preference Alignment and Real-World Validity

Hongjun An; Yiliang Song; Jiangan Chen; Jiawei Shao; Chi Zhang; Xuelong Li

arXiv:2601.06596·cs.CR·January 13, 2026

Are LLMs Vulnerable to Preference-Undermining Attacks (PUA)? A Factorial Analysis Methodology for Diagnosing the Trade-off between Preference Alignment and Real-World Validity

Hongjun An, Yiliang Song, Jiangan Chen, Jiawei Shao, Chi Zhang, Xuelong Li

PDF

Open Access

TL;DR

This paper introduces a factorial diagnostic methodology to analyze how large language models balance preference alignment with truthfulness, revealing vulnerabilities to manipulative prompts and interactions that vary across models.

Contribution

The work presents a novel factorial evaluation framework for diagnosing preference-underscoring attacks, enabling nuanced analysis of model vulnerabilities beyond aggregate benchmark scores.

Findings

01

Advanced models can be more susceptible to manipulative prompts.

02

Model-specific interactions influence robustness to preference-underscoring attacks.

03

The methodology offers finer-grained diagnostics for post-training alignment processes.

Abstract

Large Language Model (LLM) training often optimizes for preference alignment, rewarding outputs that are perceived as helpful and interaction-friendly. However, this preference-oriented objective can be exploited: manipulative prompts can steer responses toward user-appeasing agreement and away from truth-oriented correction. In this work, we investigate whether aligned models are vulnerable to Preference-Undermining Attacks (PUA), a class of manipulative prompting strategies designed to exploit the model's desire to please user preferences at the expense of truthfulness. We propose a diagnostic methodology that provides a finer-grained and more directive analysis than aggregate benchmark scores, using a factorial evaluation framework to decompose prompt-induced shifts into interpretable effects of system objectives (truth- vs. preference-oriented) and PUA-style dialogue factors…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Ethics and Social Impacts of AI