$\delta$-STEAL: LLM Stealing Attack with Local Differential Privacy
Kieu Dang, Phung Lai, NhatHai Phan, Yelong Shen, Ruoming Jin, and Abdallah Khreishah

TL;DR
This paper introduces $oldsymbol{ ext{ extdelta}- ext{STEAL}}$, a novel LLM model stealing attack that uses local differential privacy to bypass watermarks while maintaining high attack success rates, threatening intellectual property protections.
Contribution
The paper presents $ ext{ extdelta}- ext{STEAL}$, a new attack method that obfuscates watermarks in LLM outputs using LDP, effectively bypassing watermark detectors without losing model utility.
Findings
Achieves up to 96.95% attack success rate
Effectively bypasses watermark detection methods
Balances attack success and model utility via noise scale
Abstract
Large language models (LLMs) demonstrate remarkable capabilities across various tasks. However, their deployment introduces significant risks related to intellectual property. In this context, we focus on model stealing attacks, where adversaries replicate the behaviors of these models to steal services. These attacks are highly relevant to proprietary LLMs and pose serious threats to revenue and financial stability. To mitigate these risks, the watermarking solution embeds imperceptible patterns in LLM outputs, enabling model traceability and intellectual property verification. In this paper, we study the vulnerability of LLM service providers by introducing -STEAL, a novel model stealing attack that bypasses the service provider's watermark detectors while preserving the adversary's model utility. -STEAL injects noise into the token embeddings of the adversary's model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
