Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs

Ziyang Liu

arXiv:2604.18179·cs.CR·April 21, 2026

Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs

Ziyang Liu

PDF

TL;DR

This paper introduces a commit-open protocol using Merkle trees and SAE feature-traces to detect dishonest model substitutions in hosted LLMs, effectively preventing various attack strategies.

Contribution

It presents a novel protocol that secures model output integrity in hosted LLMs by leveraging commit-open schemes and SAE feature-traces, outperforming existing methods.

Findings

01

All tested attackers were rejected at a stable threshold.

02

The protocol outperforms SVIP-style baseline in detecting substitutes.

03

Commitment adds minimal overhead, <=2.1% in wall-clock time.

Abstract

Hosted-LLM providers have a silent-substitution incentive: advertise a stronger model while serving cheaper replies. Probe-after-return schemes such as SVIP leave a parallel-serve side-channel, since a dishonest provider can route the verifier's probe to the advertised model while serving ordinary users from a substitute. We propose a commit-open protocol that closes this gap. Before any opening request, the provider commits via a Merkle tree to a per-position sparse-autoencoder (SAE) feature-trace sketch of its served output at a published probe layer. A verifier opens random positions, scores them against a public named-circuit probe library calibrated with cross-backend noise, and decides with a fixed-threshold joint-consistency z-score rule. We instantiate the protocol on three backbones -- Qwen3-1.7B, Gemma-2-2B, and a 4.5x scale-up to Gemma-2-9B with a 131k-feature SAE. Of 17…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.