Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch

Yifan Zhang; Liang Hu; Haofeng Sun; Peiyu Wang; Yichen Wei; Shukang Yin; Jiangbo Pei; Wei Shen; Peng Xia; Yi Peng; Tianyidan Xie; Eric Li; Yang Liu; Xuchen Song; Yahui Zhou

arXiv:2512.02395·cs.CV·December 9, 2025

Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch

Yifan Zhang, Liang Hu, Haofeng Sun, Peiyu Wang, Yichen Wei, Shukang Yin, Jiangbo Pei, Wei Shen, Peng Xia, Yi Peng, Tianyidan Xie, Eric Li, Yang Liu, Xuchen Song, Yahui Zhou

PDF

Open Access 1 Models

TL;DR

Skywork-R1V4 is a multimodal agentic model that integrates planning, image manipulation, and knowledge retrieval through interleaved reasoning, achieving state-of-the-art results without reinforcement learning.

Contribution

It introduces a unified multimodal agentic framework trained solely with supervised fine-tuning, enabling complex reasoning and tool use without reinforcement learning.

Findings

01

Achieves top scores on MMSearch and FVQA benchmarks.

02

Demonstrates emergent long-horizon reasoning with multiple tool calls.

03

Operates effectively through supervised training on high-quality trajectories.

Abstract

Despite recent progress in multimodal agentic systems, existing approaches often treat image manipulation and web search as disjoint capabilities, rely heavily on costly reinforcement learning, and lack planning grounded in real tool-execution traces. To address these limitations, we present Skywork-R1V4, a 30B (A3B) parameter multimodal agentic model that unifies multimodal planning, active image manipulation ("thinking with images"), deep multimodal search, and, most critically, interleaved reasoning that dynamically alternates between visual operations and external knowledge retrieval. Trained solely via supervised fine-tuning on fewer than 30,000 high-quality, planning-execution-consistent trajectories and validated through stepwise consistency filtering, Skywork-R1V4 achieves state-of-the-art results across perception and multimodal search benchmarks: it scores 66.1 on MMSearch and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Skywork/R1V4
model· ♡ 14
♡ 14

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Robot Manipulation and Learning