Involuntary In-Context Learning: Exploiting Few-Shot Pattern Completion to Bypass Safety Alignment in GPT-5.4
Alex Polyakov, Daniel Kuznetsov

TL;DR
The paper introduces Involuntary In-Context Learning (IICL), an attack method that exploits few-shot pattern completion to bypass safety alignment in GPT-5.4 and other large language models.
Contribution
It demonstrates a novel attack class using abstract operator framing and shows how specific factors influence bypass success, revealing vulnerabilities in safety training.
Findings
Semantic operator naming achieves 100% bypass rate.
Abstract framing is essential for the attack's effectiveness.
Example ordering significantly impacts attack success.
Abstract
Safety alignment in large language models relies on behavioral training that can be overridden when sufficiently strong in-context patterns compete with learned refusal behaviors. We introduce Involuntary In-Context Learning (IICL), an attack class that uses abstract operator framing with few-shot examples to force pattern completion that overrides safety training. Through 3479 probes across 10 OpenAI models, we identify the attack's effective components through a seven-experiment ablation study. Key findings: (1)~semantic operator naming achieves 100\,\% bypass rate (50/50, ); (2)~the attack requires abstract framing, since identical examples in direct question-and-answer format yield 0\,\%; (3)~example ordering matters strongly (interleaved: 76\,\%, harmful-first: 6\,\%); (4)~temperature has no meaningful effect (46--56\,\% across 0.0--1.0). On the HarmBench benchmark, IICL…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
