ProSocialAlign: Preference Conditioned Test Time Alignment in Language Models
Somnath Banerjee, Sayan Layek, Sayantan Adak, Mykola Pechenizkiy, Animesh Mukherjee, Rima Hazra

TL;DR
ProSocialAlign is a test-time, parameter-efficient framework that enhances language model safety and alignment by steering responses toward safety and empathy without retraining, using constrained generation and preference modeling.
Contribution
It introduces a novel, modular approach combining harm mitigation and preference-aware decoding for safer, more aligned language model outputs at inference time.
Findings
Achieves state-of-the-art safety performance across benchmarks.
Reduces unsafe content leakage effectively.
Improves alignment with human values.
Abstract
Current language model safety paradigms often fall short in emotionally charged or high-stakes settings, where refusal-only approaches may alienate users and naive compliance can amplify risk. We propose ProSocialAlign, a test-time, parameter-efficient framework that steers generation toward safe, empathetic, and value-aligned responses without retraining the base model. We formalize five human-centered objectives and cast safety as lexicographic constrained generation: first, applying hard constraints to eliminate harmful continuations; then optimizing for prosocial quality within the safe set. Our method combines (i) directional regulation, a harm-mitigation mechanism that subtracts a learned "harm vector" in parameter space, and (ii) preference-aware autoregressive reward modeling trained jointly across attributes with gradient conflict resolution, enabling fine-grained,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)
