OneThinker: All-in-one Reasoning Model for Image and Video

Kaituo Feng; Manyuan Zhang; Hongyu Li; Kaixuan Fan; Shuang Chen; Yilei Jiang; Dian Zheng; Peiwen Sun; Yiyuan Zhang; Haoze Sun; Yan Feng; Peng Pei; Xunliang Cai; Xiangyu Yue

arXiv:2512.03043·cs.CV·April 29, 2026

OneThinker: All-in-one Reasoning Model for Image and Video

Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, Yan Feng, Peng Pei, Xunliang Cai, Xiangyu Yue

PDF

2 Repos 3 Models 3 Datasets

TL;DR

OneThinker is a unified multimodal reasoning model that integrates image and video understanding across multiple tasks, demonstrating strong performance and zero-shot capabilities.

Contribution

It introduces a comprehensive training corpus and a novel RL optimization method to unify diverse visual reasoning tasks in a single model.

Findings

01

Achieves strong results on 31 benchmarks across 10 tasks.

02

Demonstrates effective knowledge transfer between tasks.

03

Shows preliminary zero-shot generalization ability.

Abstract

Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically train separate models for different tasks and treat image and video reasoning as disjoint domains. This results in limited scalability toward a multimodal reasoning generalist, which restricts practical versatility and hinders potential knowledge sharing across tasks and modalities. To this end, we propose OneThinker, an all-in-one reasoning model that unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the OneThinker-600k training corpus covering all these tasks and employ commercial models for CoT annotation, resulting in OneThinker-SFT-340k for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.