Demystifying optimized prompts in language models

Rimon Melamed; Lucas H. McCabe; H. Howie Huang

arXiv:2505.02273·cs.CL·September 4, 2025

Demystifying optimized prompts in language models

Rimon Melamed, Lucas H. McCabe, H. Howie Huang

PDF

Open Access 1 Video

TL;DR

This paper investigates the structure and internal mechanisms of optimized prompts in language models, revealing their composition, how models interpret them, and their distinguishability from natural language prompts.

Contribution

It provides a detailed analysis of optimized prompts, showing their composition, how models process them internally, and their consistent behavior across different models.

Findings

01

Optimized prompts mainly consist of punctuation and rare nouns.

02

Models distinguish optimized prompts from natural language based on activation patterns.

03

Optimized prompts follow similar representation formation paths across models.

Abstract

Modern language models (LMs) are not robust to out-of-distribution inputs. Machine generated (``optimized'') prompts can be used to modulate LM outputs and induce specific behaviors while appearing completely uninterpretable. In this work, we investigate the composition of optimized prompts, as well as the mechanisms by which LMs parse and build predictions from optimized prompts. We find that optimized prompts primarily consist of punctuation and noun tokens which are more rare in the training data. Internally, optimized prompts are clearly distinguishable from natural language counterparts based on sparse subsets of the model's activations. Across various families of instruction-tuned models, optimized prompts follow a similar path in how their representations form through the network.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Demystifying optimized prompts in language models· underline

Taxonomy

TopicsNatural Language Processing Techniques