Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic
Chuou Xu, Liya Ji, Qifeng Chen

TL;DR
This paper introduces new tasks and a dataset for visual semantic arithmetic, proposing a reinforcement fine-tuning method that significantly improves large vision-language models' relational reasoning in images.
Contribution
It formulates novel subtraction and three-term operations, constructs the IRPD benchmark, and proposes SAri-RFT for enhanced cross-modal reasoning in LVLMs.
Findings
Achieves state-of-the-art results on IRPD and Visual7W-Telling datasets.
Demonstrates improved reasoning in domestic robotics scenarios.
Provides datasets and code for further research.
Abstract
Reinforcement learning (RL) as post-training is crucial for enhancing the reasoning ability of large language models (LLMs) in coding and math. However, their capacity for visual semantic arithmetic, inferring relationships from images, remains underexplored. The classic text analogy "king"-"man"+"woman" = "queen" illustrates relational reasoning, yet replacing text with images of "king" and "man" significantly reduces performance because it requires commonsense knowledge and the extraction of concise concepts from irrelevant visual details. This capability is important for service and domestic robotics in unstructured environments, where robots must infer semantic relationships among objects, agents, and actions. In a kitchen, recognizing from images that "powder" and "cake" are related by "is made of" grounds symbolic relations in perception, enabling tool substitution, task…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
