Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning
Chaoran Chen, Dayu Yuan, Peter Kairouz

TL;DR
This paper introduces Behavioral Canaries, a novel auditing method for RL fine-tuning that detects unauthorized data influence by identifying behavioral changes rather than memorization.
Contribution
It presents a new behavioral signaling approach to audit RL-trained models for data influence, addressing limitations of existing memorization-based methods.
Findings
Achieved 67% detection rate at 10% false positives for document-conditioned training.
Behavioral signals effectively identify training influence through distributional behavioral changes.
Established Behavioral Canaries as a new auditing mechanism for RL fine-tuning pipelines.
Abstract
In agentic workflows, LLMs frequently process retrieved contexts that are legally protected from further training. However, auditors currently lack a reliable way to verify if a provider has violated the terms of service by incorporating these data into post-training, especially through Reinforcement Learning (RL). While standard auditing relies on verbatim memorization and membership inference, these methods are ineffective for RL-trained models, as RL primarily influences a model's behavioral style rather than the retention of specific facts. To bridge this gap, we introduce Behavioral Canaries, a new auditing mechanism for RLFT pipelines. The framework instruments preference data by pairing document triggers with feedback that rewards a distinctive stylistic response, inducing a latent trigger-conditioned preference if such data are used in training. Empirical results show that these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
