Alignment-Aware Model Adaptation via Feedback-Guided Optimization

Gaurav Bhatt; Aditya Chinchure; Jiawei Zhou; Leonid Sigal

arXiv:2602.02258·cs.LG·February 6, 2026

Alignment-Aware Model Adaptation via Feedback-Guided Optimization

Gaurav Bhatt, Aditya Chinchure, Jiawei Zhou, Leonid Sigal

PDF

Open Access

TL;DR

This paper introduces an alignment-aware fine-tuning method for foundation models that uses feedback signals and adaptive gating to improve alignment, safety, and hallucination avoidance without harming task performance.

Contribution

It presents a novel feedback-guided optimization framework with adaptive gating and abstention mechanisms for better alignment during model fine-tuning.

Findings

01

Reduces harmful and hallucinated outputs in fine-tuned models

02

Maintains downstream task performance while improving alignment

03

Demonstrates robustness against adversarial attacks and unsafe initializations

Abstract

Fine-tuning is the primary mechanism for adapting foundation models to downstream tasks; however, standard approaches largely optimize task objectives in isolation and do not account for secondary yet critical alignment objectives (e.g., safety and hallucination avoidance). As a result, downstream fine-tuning can degrade alignment and fail to correct pre-existing misaligned behavior. We propose an alignment-aware fine-tuning framework that integrates feedback from an external alignment signal through policy-gradient-based regularization. Our method introduces an adaptive gating mechanism that dynamically balances supervised and alignment-driven gradients on a per-sample basis, prioritizing uncertain or misaligned cases while allowing well-aligned examples to follow standard supervised updates. The framework further learns abstention behavior for fully misaligned inputs, incorporating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Domain Adaptation and Few-Shot Learning