CoEmoGen: Towards Semantically-Coherent and Scalable Emotional Image Content Generation

Kaishen Yuan; Yuting Zhang; Shang Gao; Yijie Zhu; Wenshuo Chen; Yutao Yue

arXiv:2508.03535·cs.CV·August 6, 2025

CoEmoGen: Towards Semantically-Coherent and Scalable Emotional Image Content Generation

Kaishen Yuan, Yuting Zhang, Shang Gao, Yijie Zhu, Wenshuo Chen, Yutao Yue

PDF

3 Reviews

TL;DR

CoEmoGen introduces a scalable, semantically coherent method for emotional image generation using multimodal large language models and a hierarchical adaptation module, outperforming existing approaches in emotional faithfulness and coherence.

Contribution

It presents CoEmoGen, a novel pipeline that enhances emotional image generation by leveraging multimodal models and a hierarchical adaptation mechanism for better semantic coherence and scalability.

Findings

01

Outperforms existing methods in emotional faithfulness.

02

Achieves high semantic coherence in generated images.

03

Demonstrates scalability with a large curated dataset.

Abstract

Emotional Image Content Generation (EICG) aims to generate semantically clear and emotionally faithful images based on given emotion categories, with broad application prospects. While recent text-to-image diffusion models excel at generating concrete concepts, they struggle with the complexity of abstract emotions. There have also emerged methods specifically designed for EICG, but they excessively rely on word-level attribute labels for guidance, which suffer from semantic incoherence, ambiguity, and limited scalability. To address these challenges, we propose CoEmoGen, a novel pipeline notable for its semantic coherence and high scalability. Specifically, leveraging multimodal large language models (MLLMs), we construct high-quality captions focused on emotion-triggering content for context-rich semantic guidance. Furthermore, inspired by psychological insights, we design a…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 5

Strengths

A substantive assessment of the strengths of the paper, touching on each of the following dimensions: originality, quality, clarity, and significance. We encourage reviewers to be broad in their definitions of originality and significance. For example, originality may arise from a new definition or problem formulation, creative combinations of existing ideas, application to a new domain, or removing limitations from prior results. The paper begins by defining a clear objective: precisely identi

Weaknesses

CoEmoGen doesn't completely escape label dependency. It merely replaces 'fine-grained attribute labels' with a 'coarse-grained emotion label'. Therefore, its scalability is relative; it still requires a dataset pre-annotated with emotions as a starting point, rather than being able to learn from completely unsupervised images.

Reviewer 02Rating 6Confidence 3

Strengths

1. The paper correctly identifies a critical flaw in prior EICG work: the reliance on word-level attribute labels leads to semantic incoherence (e.g., unnatural "collage-like" images). The shift from isolated word-level guidance to sentence-level semantic guidance is a significant and logical paradigm shift that directly addresses this core problem, resulting in more natural and contextually sound images. 2. The design of the HiLoRA module is well-motivated by the psychological observation that

Weaknesses

1. The entire "coherent semantic acquisition" pipeline is fundamentally bottlenecked by the quality of the MLLM used for captioning. The authors acknowledge the risk of MLLM hallucinations and use a heuristic CLIP-based filtering method (discarding the bottom 20% ) to mitigate this. However, this filtering may not be robust enough to catch subtle semantic or emotional inaccuracies, and the model's performance is intrinsically tied to the chosen MLLM's capabilities. 2. While the creation of EmoA

Reviewer 03Rating 4Confidence 4

Strengths

1. The introduction of MLLM-generated sentence-level captions offers an effective and scalable framework for automated emotional annotation. 2. The proposed HiLoRA architecture elegantly integrates shared polarity-level and emotion-specific representations to achieve psychologically grounded emotional modeling. 3. By constructing the EmoArt dataset, the work successfully extends emotional image generation beyond photographic realism into artistic and creative domains.

Weaknesses

1. The MLLM-generated captions may introduce subtle semantic biases or hallucinations, yet there is no human annotation or validation to assess their linguistic accuracy or emotional authenticity. 2. The use of a frozen CLIP text encoder stabilizes training but limits emotional adaptability, as CLIP’s language space is not optimized for affective or psychological semantics. 3. The visual–emotion bias inherent in EmoSet (e.g., fear = terrifying faces) is amplified by multimodal captions and fur

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.