CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization
Xinhai Hou, Shaoyuan Xu, Manan Biyani, Moyan Li, Jia Liu, Todd C. Hollon, Bryan Wang

TL;DR
This paper introduces CodeV, a visual reasoning agent trained with Tool-Aware Policy Optimization, which improves the faithfulness of visual tool use in multimodal reasoning tasks, addressing issues of unfaithful reasoning despite high accuracy.
Contribution
The paper proposes a new evaluation protocol for faithfulness in visual reasoning and introduces CodeV, a code-based agent trained with a novel RL framework to enhance faithful tool use.
Findings
CodeV achieves higher faithfulness in tool use on visual search benchmarks.
CodeV maintains competitive or superior accuracy across various multimodal tasks.
The TAPO framework simplifies supervision and reduces reward hacking in visual reasoning models.
Abstract
Agentic vision-language models are increasingly trained to "think with images" by calling image operations. However, we show that high final-answer accuracy often hides unfaithful visual reasoning: models may invoke tools on irrelevant regions or ignore tool outputs entirely, yet still guess the correct answer. In this work, we first propose a faithfulness evaluation protocol that measures whether intermediate visual tool outputs (e.g., crops) actually contain the queried evidence. This reveals that recent visual agents achieve high final-answer accuracy but exhibit low rates of faithful tool-use on visual search benchmarks. We then introduce CodeV, a code-based visual agent trained with Tool-Aware Policy Optimization (TAPO). TAPO is a process-level RL framework that augments GRPO with dense rewards defined directly on visual tool inputs and outputs, rather than on chain-of-thought…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning
