Focus-N-Fix: Region-Aware Fine-Tuning for Text-to-Image Generation
Xiaoying Xing, Avinab Saha, Junfeng He, Susan Hao, Paul Vicol,, Moonkyung Ryu, Gang Li, Sahil Singla, Sarah Young, Yinxiao Li, Feng Yang, and, Deepak Ramachandran

TL;DR
Focus-N-Fix is a region-aware fine-tuning method for text-to-image models that improves safety and quality in problematic regions without degrading overall image fidelity.
Contribution
It introduces a novel region-specific fine-tuning approach that enhances localized image quality and safety in T2I models while maintaining global structure.
Findings
Significant improvements in safety and plausibility in targeted regions
Minimal or no degradation in overall image quality
Localized corrections outperform global fine-tuning methods
Abstract
Text-to-image (T2I) generation has made significant advances in recent years, but challenges still remain in the generation of perceptual artifacts, misalignment with complex prompts, and safety. The prevailing approach to address these issues involves collecting human feedback on generated images, training reward models to estimate human feedback, and then fine-tuning T2I models based on the reward models to align them with human preferences. However, while existing reward fine-tuning methods can produce images with higher rewards, they may change model behavior in unexpected ways. For example, fine-tuning for one quality aspect (e.g., safety) may degrade other aspects (e.g., prompt alignment), or may lead to reward hacking (e.g., finding a way to increase rewards without having the intended effect). In this paper, we propose Focus-N-Fix, a region-aware fine-tuning method that trains…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsALIGN
