Counterfactual Likelihood Tests for Indirect Influence in Private Reasoning Channels
Alexander Boesgaard Lorup (Openhagen)

TL;DR
This paper introduces a counterfactual likelihood test to measure influence in private reasoning channels, demonstrating its effectiveness over traditional textual probes and highlighting the importance of separating direct and indirect influence in private AI systems.
Contribution
The paper presents a novel counterfactual likelihood method for evaluating private influence, controlling confounds, and providing more reliable influence measurement in reasoning models.
Findings
Counterfactual likelihood separates unmasked and masked influence conditions.
Reverse influence from B to A is near zero in masked validation.
The method reliably identifies private-to-public influence pathways.
Abstract
Reasoning systems increasingly separate intermediate computation into private and public channels, creating evaluation cases that look similar in transcripts: independent co-derivation, direct access to private content, and indirect influence through public communication. This paper presents a counterfactual likelihood test for measuring influence between private reasoning channels. The method replaces an upstream private block with a length-matched donor block, holds the public token sequence and downstream target fixed, and measures the downstream target's negative-log-likelihood shift. On a 7B role-channel reasoning model used for validation, textual probes are unreliable: raw n-gram overlap overstates leakage, corrected overlap remains noisy, and canary reproduction reports no discrimination. Counterfactual likelihood separates unmasked and masked conditions, while length matching…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
