Language Model Inversion through End-to-End Differentiation

Kevin Yandoka Denamgana\"i; Kartic Subr

arXiv:2602.11044·cs.CL·February 12, 2026

Language Model Inversion through End-to-End Differentiation

Kevin Yandoka Denamgana\"i, Kartic Subr

PDF

Open Access

TL;DR

This paper introduces a gradient-based method to invert language models by optimizing prompts to produce desired outputs, viewing models as functions on token distributions, enabling efficient prompt optimization.

Contribution

It presents a novel end-to-end differentiable framework for language model inversion, allowing prompt optimization via gradient descent on frozen models.

Findings

01

Effective prompt optimization for target outputs of length 20

02

Works reliably on several white-box language models

03

Handles prompt lengths up to 80 tokens

Abstract

Despite emerging research on Language Models (LM), few approaches analyse the invertibility of LMs. That is, given a LM and a desirable target output sequence of tokens, determining what input prompts would yield the target output remains an open problem. We formulate this problem as a classical gradient-based optimisation. First, we propose a simple algorithm to achieve end-to-end differentiability of a given (frozen) LM and then find optimised prompts via gradient descent. Our central insight is to view LMs as functions operating on sequences of distributions over tokens (rather than the traditional view as functions on sequences of tokens). Our experiments and ablations demonstrate that our DLM-powered inversion can reliably and efficiently optimise prompts of lengths $10$ and $80$ for targets of length $20$ , for several white-box LMs (out-of-the-box).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Generative Adversarial Networks and Image Synthesis