Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs
Carolina Camassa, Derek Shiller

TL;DR
This paper investigates how large language models handle conflicting instructions and pattern demonstrations, revealing their varying robustness and susceptibility to induction pressures across different models and outputs.
Contribution
It introduces a systematic evaluation of instruction-following versus pattern-completion conflicts, highlighting factors influencing model robustness and the limitations of current instruction-following capabilities.
Findings
Instruction-following rates vary widely across models and instructions.
Models resist induction longer when instructions align with prior training.
Output diversity increases resistance to induction pressure.
Abstract
Language models are trained to follow instructions, but they are also powerful pattern completers. What happens when these two objectives conflict? We construct conversations in which a user instruction to behave in a target way T (e.g., always output a specific token, answer in a particular language, or adopt a persona) is opposed by N hardcoded assistant turns demonstrating a competing pattern P. We then measure instruction-following (IF) rates in this setting, across 13 models and 16 different instructions, for up to 50 turns. Average instruction-following rates range from 1% to 99% across models, largely uncorrelated with standard capability benchmarks. The transition from instruction-following to pattern-following is universal but highly model-dependent. Robustness is modulated both by instruction content, with models resisting induction longer when instructions align with their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
