MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning

Chenhao Zhang; Yazhe Niu; Hongsheng Li

arXiv:2602.10575·cs.CV·February 12, 2026

MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning

Chenhao Zhang, Yazhe Niu, Hongsheng Li

PDF

Open Access 3 Models 4 Datasets

TL;DR

MetaphorStar is an innovative end-to-end visual reinforcement learning framework that significantly advances image metaphor understanding and reasoning, outperforming existing multimodal models on various benchmarks.

Contribution

We introduce MetaphorStar, the first end-to-end visual RL framework for image implication tasks, with a new dataset, method, and benchmark, achieving state-of-the-art results.

Findings

01

MetaphorStar improves performance by 82.6% on image implication benchmarks.

02

It outperforms 20+ mainstream multimodal models on multiple question types.

03

Learning image implication tasks enhances complex visual reasoning abilities.

Abstract

Metaphorical comprehension in images remains a critical challenge for Nowadays AI systems. While Multimodal Large Language Models (MLLMs) excel at basic Visual Question Answering (VQA), they consistently struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. This difficulty stems from the task's demand for sophisticated multi-hop reasoning, cultural context, and Theory of Mind (ToM) capabilities, which current models lack. To fill this gap, we propose MetaphorStar, the first end-to-end visual reinforcement learning (RL) framework for image implication tasks. Our framework includes three core components: the fine-grained dataset TFQ-Data, the visual RL method TFQ-GRPO, and the well-structured benchmark TFQ-Bench. Our fully open-source MetaphorStar family, trained using TFQ-GRPO on TFQ-Data, significantly improves performance by an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Domain Adaptation and Few-Shot Learning