TL;DR
This paper uses persistent homology to analyze how adversarial inputs alter the high-dimensional geometry of LLMs' internal representations, revealing a consistent topological compression across models and attack types.
Contribution
It introduces a novel topological analysis framework that captures nonlinear geometric invariants of LLM representations under adversarial influence, complementing existing interpretability methods.
Findings
Adversarial inputs cause topological compression of latent spaces.
The topological signature is consistent across different models and attack modes.
The framework reveals geometric invariants that are early and discriminative in the network.
Abstract
Existing interpretability methods for Large Language Models (LLMs) predominantly capture linear directions or isolated features. This overlooks the high-dimensional, relational, and nonlinear geometry of model representations. We apply persistent homology (PH) to characterize how adversarial inputs reshape the geometry and topology of internal representation spaces of LLMs. This phenomenon, especially when considered across operationally different attack modes, remains poorly understood. We analyze six models (3.8B to 70B parameters) under two distinct attacks, indirect prompt injection and backdoor fine--tuning, and show that a consistent topological signature persists throughout. Adversarial inputs induce topological compression, where the latent space becomes structurally simpler, collapsing the latent space from varied, compact, small-scale features into fewer, dominant, large-scale…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
