Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment
Minh-Quan Le, Gaurav Mittal, Tianjian Meng, A S M Iftekhar, Vishwas Suryanarayanan, Barun Patra, Dimitris Samaras, Mei Chen

TL;DR
Hummingbird is a diffusion-based image generator that preserves scene attributes and diversity from multimodal context, improving scene-aware tasks like VQA and HOI reasoning.
Contribution
It introduces a novel multimodal context evaluator and rewards to ensure fidelity and diversity in generated images based on complex multimodal inputs.
Findings
Outperforms existing methods in fidelity and diversity metrics
Achieves superior scene attribute preservation in complex tasks
Validates effectiveness on MME Perception and Bongard HOI datasets
Abstract
While diffusion models are powerful in generating high-quality, diverse synthetic data for object-centric tasks, existing methods struggle with scene-aware tasks such as Visual Question Answering (VQA) and Human-Object Interaction (HOI) Reasoning, where it is critical to preserve scene attributes in generated images consistent with a multimodal context, i.e. a reference image with accompanying text guidance query. To address this, we introduce , the first diffusion-based image generator which, given a multimodal context, generates highly diverse images w.r.t. the reference image while ensuring high fidelity by accurately preserving scene attributes, such as object interactions and spatial relationships from the text guidance. Hummingbird employs a novel Multimodal Context Evaluator that simultaneously optimizes our formulated Global Semantic and Fine-grained…
Peer Reviews
Decision·ICLR 2025 Poster
1. Interesting framework. The use of the Multimodal Context Evaluator with reward mechanisms (Global Semantic and Fine-Grained Consistency) is a unique approach that successfully addresses both the fidelity and diversity. 2. Comprehensive Evaluation. The model is tested across various benchmarks and datasets, including VQAv2, GQA, and ImageNet, validating robustness under both scene-aware and object-centric tasks. 3. Performance Gains. Empirical results show that Hummingbird consistently perform
1. Clarity of the Fine-Grained Consistency Reward. How the ITM classifier's positive class is determined sholud be clarified further. What does the class ‘j’ mean in equation (5)? 2. Limitations are not discussed. It would be more insightful to discuss about the potential limitations and possible improvement of the idea.
1. The first work applying diffusion models for image data augmentation. 2. A pioneering study demonstrating the potential of synthetic data. 3. Produces impressive results.
1. The writing needs improvement; for example, the introduction should clearly state that the research task focuses on data augmentation. 2. Consider adding the following experiments: 1) evaluation of augmented image quality, such as using FID scores and user studies. 2) more assessment of the proposed augmentation's performance in training, not test-time. 3) Inclusion of a baseline in Table 4, such as "random seed + stable diffusion," to compare data augmentation capabilities, as the vanilla di
Originality: The paper introduces a new multimodal context alignment approach that balances diversity and fidelity. The introduction of a Multimodal Context Evaluator and reward mechanism demonstrates high originality. Quality: The experimental design is well-conducted, clearly validating the proposed method's effectiveness in maintaining diversity and high fidelity. Significance: Generating high-fidelity and diverse images is crucial for many complex visual tasks, particularly those involving
1. Lack of comprehensive theoretical basis: While global semantic and fine-grained consistency rewards are proposed, there is a lack of detailed mathematical derivation or theoretical analysis, especially regarding why these rewards are effective in improving fidelity. 2. Limited evaluation dataset diversity: The paper uses the MME and Bongard HOI datasets, but their representativeness may be limited, particularly regarding generalizing the model to broader scenarios. It is recommended to valid
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Image Retrieval and Classification Techniques · Video Analysis and Summarization
MethodsDiffusion
