Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs
Ziyang Liu

TL;DR
This paper introduces a commit-open protocol using Merkle trees and SAE feature-traces to detect dishonest model substitutions in hosted LLMs, effectively preventing various attack strategies.
Contribution
It presents a novel protocol that secures model output integrity in hosted LLMs by leveraging commit-open schemes and SAE feature-traces, outperforming existing methods.
Findings
All tested attackers were rejected at a stable threshold.
The protocol outperforms SVIP-style baseline in detecting substitutes.
Commitment adds minimal overhead, <=2.1% in wall-clock time.
Abstract
Hosted-LLM providers have a silent-substitution incentive: advertise a stronger model while serving cheaper replies. Probe-after-return schemes such as SVIP leave a parallel-serve side-channel, since a dishonest provider can route the verifier's probe to the advertised model while serving ordinary users from a substitute. We propose a commit-open protocol that closes this gap. Before any opening request, the provider commits via a Merkle tree to a per-position sparse-autoencoder (SAE) feature-trace sketch of its served output at a published probe layer. A verifier opens random positions, scores them against a public named-circuit probe library calibrated with cross-backend noise, and decides with a fixed-threshold joint-consistency z-score rule. We instantiate the protocol on three backbones -- Qwen3-1.7B, Gemma-2-2B, and a 4.5x scale-up to Gemma-2-9B with a 131k-feature SAE. Of 17…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
