Forge-and-Quench: Enhancing Image Generation for Higher Fidelity in Unified Multimodal Models
Yanbing Zeng, Jia Wang, Hanghang Ma, Junqiang Wu, Jie Zhu, Xiaoming Wei, Jie Hu

TL;DR
Forge-and-Quench is a novel unified multimodal framework that leverages understanding to significantly improve image fidelity and detail in generated images, while maintaining model flexibility and efficiency.
Contribution
It introduces a new method using a Bridge Feature and Bridge Adapter to enhance image generation by utilizing understanding models, with minimal training overhead.
Findings
Significant improvement in image fidelity and detail.
Maintains instruction-following accuracy.
Efficient model migration with reduced training costs.
Abstract
Integrating image generation and understanding into a single framework has become a pivotal goal in the multimodal domain. However, how understanding can effectively assist generation has not been fully explored. Unlike previous works that focus on leveraging reasoning abilities and world knowledge from understanding models, this paper introduces a novel perspective: leveraging understanding to enhance the fidelity and detail richness of generated images. To this end, we propose Forge-and-Quench, a new unified framework that puts this principle into practice. In the generation process of our framework, an MLLM first reasons over the entire conversational context, including text instructions, to produce an enhanced text instruction. This refined instruction is then mapped to a virtual visual representation, termed the Bridge Feature, via a novel Bridge Adapter. This feature acts as a…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper is clearly written, and the overall framework is easy to understand and follow. 2. The observation that introducing a reference image can improve fidelity and detail richness in generated images is insightful and valuable. The experimental results effectively demonstrate this contribution.
1. The primary weakness lies in the design of how the reference image is created. The authors propose using a Bridge Adapter to map the MLLM-generated text into a SigLIP image feature via a rectified flow model, but this design seems unintuitive. The mapping step still introduces information loss. In contrast, BLIP-3o directly generates the SigLIP image embedding in a rectified flow manner via their unified model and then generate a high-fidelity image from that embedding, which is more consiste
1.The paper’s modular adapter design elegantly separates understanding (Forge) and generation (Quench), enabling compatibility with a variety of existing MLLM and T2I models without requiring joint pre-training or large-scale end-to-end optimization. 2.The method demonstrates consistent qualitative improvements in visual fidelity and texture realism across both tested backbones, showing that injecting a virtual visual feature can strengthen fine-grained detail synthesis while maintaining reasona
1.The enhanced-text stage is not novel, as similar reasoning-based prompt enrichment has been extensively studied in earlier works such as Bagel, T2I-R1 and GoT, yet the paper does not acknowledge or discuss these predecessors specifically. The bridge-feature component largely replicates what Emu2, Seed-X, PUMA have already done with CLIP-space feature injection, and the paper fails to analyze how its approach differs conceptually. 2. The method performs worse on GenEval and DPG-Bench benchmarks
* The paper introduces a novel Bridge Feature approach that generates high-quality visual features from text alone, effectively bridging the gap in scenarios where traditional T2I methods rely on real image features. * The proposed framework is modular, with both the MLLM and T2I backbones kept frozen while only lightweight adapters are trained, which reduces training cost and improves portability across different model architectures. * The method is validated on two distinct T2I backbones and e
* The composition of the training data is not explicitly discussed, which makes it difficult to assess the proposed method’s performance in both in-domain and out-of-domain scenarios. Since the Bridge Feature requires supervised training, further comparison between the proposed method and the base model in out-of-domain cases would be necessary to understand its generalization capability. * The paper mentions that input prompts are first expanded into Enhanced Text $ t^* $, a process which empi
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
