Steering Frozen LLMs: Adaptive Social Alignment via Online Prompt Routing
Zeyu Zhang, Xiangxiang Dai, Ziyi Han, Xutong Liu, John C.S. Lui

TL;DR
This paper introduces CCLUB, an inference-time framework that adaptively steers large language models towards safer and more aligned behavior through online prompt routing, addressing the limitations of static post-training alignment methods.
Contribution
The paper presents CCLUB, a novel system-prompt routing method with theoretical guarantees and practical improvements for dynamic social alignment of LLMs during inference.
Findings
CCLUB achieves a 10.98% increase in cumulative reward.
CCLUB reduces the average suboptimality gap by 14.42%.
Theoretical analysis shows sublinear regret guarantees.
Abstract
Large language models (LLMs) are typically governed by post-training alignment (e.g., RLHF or DPO), which yields a largely static policy during deployment and inference. However, real-world safety is a full-lifecycle problem: static defenses degrade against evolving jailbreak behaviors, and fixed weights cannot adapt to pluralistic, time-varying safety norms. This motivates inference-time governance that steers behavior without costly retraining. To address this, we introduce the Consensus Clustering LinUCB Bandit (CCLUB), a unified framework for adaptive social alignment via system-prompt routing. CCLUB employs a conservative consensus clustering mechanism: it pools data only within the intersection of utility and safety similarity graphs, effectively preventing unsafe generalization across semantically proximal but risk-divergent contexts. Our theoretical analysis yields a sublinear…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education
