TL;DR
CoEmoGen introduces a scalable, semantically coherent method for emotional image generation using multimodal large language models and a hierarchical adaptation module, outperforming existing approaches in emotional faithfulness and coherence.
Contribution
It presents CoEmoGen, a novel pipeline that enhances emotional image generation by leveraging multimodal models and a hierarchical adaptation mechanism for better semantic coherence and scalability.
Findings
Outperforms existing methods in emotional faithfulness.
Achieves high semantic coherence in generated images.
Demonstrates scalability with a large curated dataset.
Abstract
Emotional Image Content Generation (EICG) aims to generate semantically clear and emotionally faithful images based on given emotion categories, with broad application prospects. While recent text-to-image diffusion models excel at generating concrete concepts, they struggle with the complexity of abstract emotions. There have also emerged methods specifically designed for EICG, but they excessively rely on word-level attribute labels for guidance, which suffer from semantic incoherence, ambiguity, and limited scalability. To address these challenges, we propose CoEmoGen, a novel pipeline notable for its semantic coherence and high scalability. Specifically, leveraging multimodal large language models (MLLMs), we construct high-quality captions focused on emotion-triggering content for context-rich semantic guidance. Furthermore, inspired by psychological insights, we design a…
Peer Reviews
Decision·ICLR 2026 Poster
A substantive assessment of the strengths of the paper, touching on each of the following dimensions: originality, quality, clarity, and significance. We encourage reviewers to be broad in their definitions of originality and significance. For example, originality may arise from a new definition or problem formulation, creative combinations of existing ideas, application to a new domain, or removing limitations from prior results. The paper begins by defining a clear objective: precisely identi
CoEmoGen doesn't completely escape label dependency. It merely replaces 'fine-grained attribute labels' with a 'coarse-grained emotion label'. Therefore, its scalability is relative; it still requires a dataset pre-annotated with emotions as a starting point, rather than being able to learn from completely unsupervised images.
1. The paper correctly identifies a critical flaw in prior EICG work: the reliance on word-level attribute labels leads to semantic incoherence (e.g., unnatural "collage-like" images). The shift from isolated word-level guidance to sentence-level semantic guidance is a significant and logical paradigm shift that directly addresses this core problem, resulting in more natural and contextually sound images. 2. The design of the HiLoRA module is well-motivated by the psychological observation that
1. The entire "coherent semantic acquisition" pipeline is fundamentally bottlenecked by the quality of the MLLM used for captioning. The authors acknowledge the risk of MLLM hallucinations and use a heuristic CLIP-based filtering method (discarding the bottom 20% ) to mitigate this. However, this filtering may not be robust enough to catch subtle semantic or emotional inaccuracies, and the model's performance is intrinsically tied to the chosen MLLM's capabilities. 2. While the creation of EmoA
1. The introduction of MLLM-generated sentence-level captions offers an effective and scalable framework for automated emotional annotation. 2. The proposed HiLoRA architecture elegantly integrates shared polarity-level and emotion-specific representations to achieve psychologically grounded emotional modeling. 3. By constructing the EmoArt dataset, the work successfully extends emotional image generation beyond photographic realism into artistic and creative domains.
1. The MLLM-generated captions may introduce subtle semantic biases or hallucinations, yet there is no human annotation or validation to assess their linguistic accuracy or emotional authenticity. 2. The use of a frozen CLIP text encoder stabilizes training but limits emotional adaptability, as CLIP’s language space is not optimized for affective or psychological semantics. 3. The visual–emotion bias inherent in EmoSet (e.g., fear = terrifying faces) is amplified by multimodal captions and fur
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
