MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning
Chenhao Zhang, Yazhe Niu, Hongsheng Li

TL;DR
MetaphorStar is an innovative end-to-end visual reinforcement learning framework that significantly advances image metaphor understanding and reasoning, outperforming existing multimodal models on various benchmarks.
Contribution
We introduce MetaphorStar, the first end-to-end visual RL framework for image implication tasks, with a new dataset, method, and benchmark, achieving state-of-the-art results.
Findings
MetaphorStar improves performance by 82.6% on image implication benchmarks.
It outperforms 20+ mainstream multimodal models on multiple question types.
Learning image implication tasks enhances complex visual reasoning abilities.
Abstract
Metaphorical comprehension in images remains a critical challenge for Nowadays AI systems. While Multimodal Large Language Models (MLLMs) excel at basic Visual Question Answering (VQA), they consistently struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. This difficulty stems from the task's demand for sophisticated multi-hop reasoning, cultural context, and Theory of Mind (ToM) capabilities, which current models lack. To fill this gap, we propose MetaphorStar, the first end-to-end visual reinforcement learning (RL) framework for image implication tasks. Our framework includes three core components: the fine-grained dataset TFQ-Data, the visual RL method TFQ-GRPO, and the well-structured benchmark TFQ-Bench. Our fully open-source MetaphorStar family, trained using TFQ-GRPO on TFQ-Data, significantly improves performance by an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Domain Adaptation and Few-Shot Learning
