Alignment-Aware Model Adaptation via Feedback-Guided Optimization
Gaurav Bhatt, Aditya Chinchure, Jiawei Zhou, Leonid Sigal

TL;DR
This paper introduces an alignment-aware fine-tuning method for foundation models that uses feedback signals and adaptive gating to improve alignment, safety, and hallucination avoidance without harming task performance.
Contribution
It presents a novel feedback-guided optimization framework with adaptive gating and abstention mechanisms for better alignment during model fine-tuning.
Findings
Reduces harmful and hallucinated outputs in fine-tuned models
Maintains downstream task performance while improving alignment
Demonstrates robustness against adversarial attacks and unsafe initializations
Abstract
Fine-tuning is the primary mechanism for adapting foundation models to downstream tasks; however, standard approaches largely optimize task objectives in isolation and do not account for secondary yet critical alignment objectives (e.g., safety and hallucination avoidance). As a result, downstream fine-tuning can degrade alignment and fail to correct pre-existing misaligned behavior. We propose an alignment-aware fine-tuning framework that integrates feedback from an external alignment signal through policy-gradient-based regularization. Our method introduces an adaptive gating mechanism that dynamically balances supervised and alignment-driven gradients on a per-sample basis, prioritizing uncertain or misaligned cases while allowing well-aligned examples to follow standard supervised updates. The framework further learns abstention behavior for fully misaligned inputs, incorporating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Domain Adaptation and Few-Shot Learning
