The Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition

Xiaoze Liu; Weichen Yu; Matt Fredrikson; Xiaoqian Wang; Jing Gao

arXiv:2601.00065·cs.LG·January 30, 2026

The Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition

Xiaoze Liu, Weichen Yu, Matt Fredrikson, Xiaoqian Wang, Jing Gao

PDF

Open Access

TL;DR

This paper uncovers a vulnerability in tokenizer transplant used in model composition, demonstrating a stealthy attack that sabotages language model outputs without detection, posing risks to modular AI systems.

Contribution

It introduces a novel, training-free attack exploiting tokenizer transplant geometry, revealing a hidden security risk in model composition workflows.

Findings

01

Attack is training-free and evades detection

02

Sabotages model outputs while maintaining donor utility

03

Persists against fine-tuning and weight merging

Abstract

The open-weight language model ecosystem is increasingly defined by model composition techniques (such as weight merging, speculative decoding, and vocabulary expansion) that remix capabilities from diverse sources. A critical prerequisite for applying these methods across different model families is tokenizer transplant, which aligns incompatible vocabularies to a shared embedding space. We demonstrate that this essential interoperability step introduces a supply-chain vulnerability: we engineer a single breaker token that is functionally inert in a donor model yet reliably reconstructs into a high-salience malicious feature after transplant into a base model. By exploiting the geometry of coefficient reuse, our attack sabotages the base model's generation while leaving the donor's utility statistically indistinguishable from nominal behavior. We formalize this as a dual-objective…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Topic Modeling