Can adversarial attacks by large language models be attributed?
Manuel Cebrian, Andres Abeliuk, Jan Arne Telle

TL;DR
This paper investigates the theoretical and empirical challenges of attributing outputs from large language models in adversarial contexts, revealing fundamental non-identifiability and rapid growth in plausible model origins.
Contribution
It introduces a formal language theory framework for LLM attribution and demonstrates the rapid increase in candidate models, highlighting practical limitations of current attribution methods.
Findings
Certain classes of LLMs are non-identifiable from outputs alone.
The number of plausible models doubles approximately every 0.5 years.
Exhaustive attribution is infeasible due to combinatorial growth and computational costs.
Abstract
Attributing outputs from Large Language Models (LLMs) in adversarial settings-such as cyberattacks and disinformation campaigns-presents significant challenges that are likely to grow in importance. We approach this attribution problem from both a theoretical and an empirical perspective, drawing on formal language theory (identification in the limit) and data-driven analysis of the expanding LLM ecosystem. By modeling an LLM's set of possible outputs as a formal language, we analyze whether finite samples of text can uniquely pinpoint the originating model. Our results show that, under mild assumptions of overlapping capabilities among models, certain classes of LLMs are fundamentally non-identifiable from their outputs alone. We delineate four regimes of theoretical identifiability: (1) an infinite class of deterministic (discrete) LLM languages is not identifiable (Gold's classical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Label Smoothing · Absolute Position Encodings · Layer Normalization · Position-Wise Feed-Forward Layer · Adam · Multi-Head Attention · Residual Connection
