LAFR: Efficient Diffusion-based Blind Face Restoration via Latent Codebook Alignment Adapter
Runyi Li, Bin Chen, Jian Zhang, Radu Timofte

TL;DR
LAFR introduces a novel latent codebook alignment adapter for diffusion-based blind face restoration, significantly improving semantic consistency and identity preservation without retraining the VAE, while reducing training time.
Contribution
The paper proposes a codebook-based latent space adapter for diffusion models that aligns low-quality and high-quality image latents, enhancing face restoration efficiency and effectiveness.
Findings
Achieves high-quality, identity-preserving face restoration from degraded images.
Reduces training time by 70% with minimal finetuning on FFHQ dataset.
Outperforms existing methods on synthetic and real-world benchmarks.
Abstract
Blind face restoration from low-quality (LQ) images is a challenging task that requires not only high-fidelity image reconstruction but also the preservation of facial identity. While diffusion models like Stable Diffusion have shown promise in generating high-quality (HQ) images, their VAE modules are typically trained only on HQ data, resulting in semantic misalignment when encoding LQ inputs. This mismatch significantly weakens the effectiveness of LQ conditions during the denoising process. Existing approaches often tackle this issue by retraining the VAE encoder, which is computationally expensive and memory-intensive. To address this limitation efficiently, we propose LAFR (Latent Alignment for Face Restoration), a novel codebook-based latent space adapter that aligns the latent distribution of LQ images with that of HQ counterparts, enabling semantically consistent diffusion…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
Addresses the common “LQ→VAE latent mismatch” bottleneck via *codebook lookup + mapping* instead of retraining the VAE or adding heavy alignment blocks—reducing train/inference cost while stabilizing conditioning. Freezes the VAE; uses low-rank LoRA (e.g., rank=4) on selected U-Net convs; prunes unused text/time embeddings—yielding fewer parameters and lower latency than several strong baselines (including single-step diffusers). Multi-level losses and ablations indicate improved identity retent
1. Robustness under extreme degradations (heavy compression, strong motion blur, color shifts) and out-of-domain faces is not fully characterized. 2. Real data coverage is still limited; more device-/pipeline-specific degradations (smartphone ISP chains, night/IR, social-media recompression) would strengthen claims. 3. Beyond perceptual/quality metrics, standardized face-ID verification (e.g., ArcFace cosine vs. ground truth on synthetic pairs) is missing. 4. The stability/complexity of the code
1. The codebook-based latent alignment adapter efficiently resolves LQ-HQ latent misalignment without modifying the pre-trained VAE, minimizing computational overhead compared to retraining-based alternatives. 2. The multilevel restoration loss (integrating appearance, identity, and structural supervision) ensures robust identity preservation, a critical requirement for face-specific restoration tasks. 3. Exceptional data and parameter efficiency, achieving competitive results with only 600 trai
1. The use of a very small training set (600 images) is advantageous in terms of computational efficiency, but this might limit the model's generalizability in scenarios with more varied or complex datasets, particularly when faces exhibit significant variations in expressions, lighting, or occlusions. 2. The novelty and contribution of the work are limited. The core innovation is restricted to the codebook-based latent alignment adapter for bridging LQ and HQ feature distributions, while the pr
1. Efficiency and Innovation: The codebook-based alignment adapter is a novel solution to latent space misalignment, avoiding costly VAE retraining. The design is lightweight and modular, enabling effective domain adaptation with minimal parameters. 2. Data Efficiency: The claim that facial images' structural regularity allows for effective training with only 600 images is well-supported by t-SNE analysis and ablation studies (Fig. 6, Tab. 11-12). This addresses a critical practical cha
1.Limited quantitative performance: As shown in Tables 1 and 2, the proposed LAFR method underperforms compared to other approaches on nearly half of the evaluated metrics (e.g., DISTS↓, M-IQ↑, NIQE) 2.Presence of visual artifacts: The last row of Figure 4 indicates that LAFR tends to introduce noticeable visual artifacts in facial textures. This issue, though not universal, is observable in other test cases as well 3.Reproducibility and Implementation Clarity: The pruning strategy for the UNe
1. Employs a codebook that has been empirically validated to work well. 2. Raises the intriguing hypothesis that the VAE fails to align LQ latent codes.
1. It is inappropriate to claim that IP-Adapter and ControlNet demonstrate that directly using LQ images or their features as guidance for diffusion sampling leads to incorrect codes and suboptimal restoration, since these methods were not developed for restoration tasks. 2. The statement that diffusion models for BFR are trained on ImageNet appears to be inaccurate; these models are typically trained on LAION or its subsets. 3. It is better to attribute the divergence of latent representations
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image Processing Techniques · Face recognition and analysis
MethodsAdapter · Diffusion
