Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning

Chaoran Chen; Dayu Yuan; Peter Kairouz

arXiv:2604.22191·cs.CR·April 27, 2026

Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning

Chaoran Chen, Dayu Yuan, Peter Kairouz

PDF

TL;DR

This paper introduces Behavioral Canaries, a novel auditing method for RL fine-tuning that detects unauthorized data influence by identifying behavioral changes rather than memorization.

Contribution

It presents a new behavioral signaling approach to audit RL-trained models for data influence, addressing limitations of existing memorization-based methods.

Findings

01

Achieved 67% detection rate at 10% false positives for document-conditioned training.

02

Behavioral signals effectively identify training influence through distributional behavioral changes.

03

Established Behavioral Canaries as a new auditing mechanism for RL fine-tuning pipelines.

Abstract

In agentic workflows, LLMs frequently process retrieved contexts that are legally protected from further training. However, auditors currently lack a reliable way to verify if a provider has violated the terms of service by incorporating these data into post-training, especially through Reinforcement Learning (RL). While standard auditing relies on verbatim memorization and membership inference, these methods are ineffective for RL-trained models, as RL primarily influences a model's behavioral style rather than the retention of specific facts. To bridge this gap, we introduce Behavioral Canaries, a new auditing mechanism for RLFT pipelines. The framework instruments preference data by pairing document triggers with feedback that rewards a distinctive stylistic response, inducing a latent trigger-conditioned preference if such data are used in training. Empirical results show that these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.