TL;DR
OpenThinkIMG introduces an open-source framework and a novel reinforcement learning method, V-ToolRL, enabling vision-language models to learn adaptive visual tool usage for complex reasoning tasks.
Contribution
It provides the first comprehensive infrastructure for tool-augmented LVLMs and proposes V-ToolRL, a reinforcement learning approach for dynamic tool invocation.
Findings
V-ToolRL significantly improves task success rates (+28.83 points) over supervised fine-tuning.
The RL agent outperforms baseline models like Taco and CogCom by +12.7 points.
The approach surpasses GPT-4.1 in accuracy by +8.68 points.
Abstract
While humans can flexibly leverage interactive visual cognition for complex problem-solving, enabling Large Vision-Language Models (LVLMs) to learn similarly adaptive behaviors with visual tools remains challenging. A significant hurdle is the current lack of standardized infrastructure, which hinders integrating diverse tools, generating rich interaction data, and training robust agents effectively. To address these gaps, we introduce OpenThinkIMG, the first open-source, comprehensive end-to-end framework for tool-augmented LVLMs. It features standardized vision tool interfaces, scalable trajectory generation for policy initialization, and a flexible training environment. Furthermore, considering supervised fine-tuning (SFT) on static demonstrations offers limited policy generalization for dynamic tool invocation, we propose a novel reinforcement learning (RL) framework V-ToolRL to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Dropout · Layer Normalization · Byte Pair Encoding · Softmax · Absolute Position Encodings · Residual Connection
