Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic

Chuou Xu; Liya Ji; Qifeng Chen

arXiv:2604.19567·cs.AI·April 22, 2026

Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic

Chuou Xu, Liya Ji, Qifeng Chen

PDF

TL;DR

This paper introduces new tasks and a dataset for visual semantic arithmetic, proposing a reinforcement fine-tuning method that significantly improves large vision-language models' relational reasoning in images.

Contribution

It formulates novel subtraction and three-term operations, constructs the IRPD benchmark, and proposes SAri-RFT for enhanced cross-modal reasoning in LVLMs.

Findings

01

Achieves state-of-the-art results on IRPD and Visual7W-Telling datasets.

02

Demonstrates improved reasoning in domestic robotics scenarios.

03

Provides datasets and code for further research.

Abstract

Reinforcement learning (RL) as post-training is crucial for enhancing the reasoning ability of large language models (LLMs) in coding and math. However, their capacity for visual semantic arithmetic, inferring relationships from images, remains underexplored. The classic text analogy "king"-"man"+"woman" = "queen" illustrates relational reasoning, yet replacing text with images of "king" and "man" significantly reduces performance because it requires commonsense knowledge and the extraction of concise concepts from irrelevant visual details. This capability is important for service and domestic robotics in unstructured environments, where robots must infer semantic relationships among objects, agents, and actions. In a kitchen, recognizing from images that "powder" and "cake" are related by "is made of" grounds symbolic relations in perception, enabling tool substitution, task…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.