The Mind's Eye: A Multi-Faceted Reward Framework for Guiding Visual Metaphor Generation

Girish A. Koushik; Fatemeh Nazarieh; Katherine Birch; Shenbin Qian; Diptesh Kanojia

arXiv:2508.18569·cs.CL·August 27, 2025

The Mind's Eye: A Multi-Faceted Reward Framework for Guiding Visual Metaphor Generation

Girish A. Koushik, Fatemeh Nazarieh, Katherine Birch, Shenbin Qian, Diptesh Kanojia

PDF

TL;DR

This paper introduces a self-evaluating framework for visual metaphor generation that improves alignment and coherence using novel metrics and approaches, achieving better results than existing models and aligning well with human preferences.

Contribution

It proposes a new self-evaluation method with novel metrics and two approaches—one training-free and one training-based—for improved visual metaphor generation.

Findings

01

The training-free approach outperforms strong baselines on decomposition and alignment metrics.

02

Participants preferred the GPT-4o model in user studies.

03

Structured prompting enhances metaphor generation for abstract concepts.

Abstract

Visual metaphor generation is a challenging task that aims to generate an image given an input text metaphor. Inherently, it needs language understanding to bind a source concept with a target concept, in a way that preserves meaning while ensuring visual coherence. We propose a self-evaluating visual metaphor generation framework that focuses on metaphor alignment. Our self-evaluation approach combines existing metrics with our newly proposed metaphor decomposition score and a meaning alignment (MA) metric. Within this setup, we explore two novel approaches: a training-free pipeline that explicitly decomposes prompts into source-target-meaning (S-T-M) mapping for image synthesis, and a complementary training-based pipeline that improves alignment using our proposed self-evaluation reward schema, without any large-scale retraining. On the held-out test set, the training-free approach…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.