WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning

Jie Yang; Feipeng Ma; Zitian Wang; Dacheng Yin; Kang Rong; Fengyun Rao; Ruimao Zhang

arXiv:2506.07905·cs.CV·June 10, 2025

WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning

Jie Yang, Feipeng Ma, Zitian Wang, Dacheng Yin, Kang Rong, Fengyun Rao, Ruimao Zhang

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

This paper introduces WeThink, a large multimodal dataset and a reinforcement learning approach to improve general-purpose vision-language reasoning in large models, demonstrating significant performance gains across diverse tasks.

Contribution

The paper presents a novel scalable QA synthesis pipeline, a large annotated multimodal dataset, and a hybrid RL training method to advance general-purpose vision-language reasoning.

Findings

01

Enhanced performance on 14 MLLM benchmarks

02

Automated data pipeline increases data diversity and model accuracy

03

Effective hybrid reward mechanism improves RL training efficiency

Abstract

Building on the success of text-based reasoning models like DeepSeek-R1, extending these capabilities to multimodal reasoning holds great promise. While recent works have attempted to adapt DeepSeek-R1-style reinforcement learning (RL) training paradigms to multimodal large language models (MLLM), focusing on domain-specific tasks like math and visual perception, a critical question remains: How can we achieve the general-purpose visual-language reasoning through RL? To address this challenge, we make three key efforts: (1) A novel Scalable Multimodal QA Synthesis pipeline that autonomously generates context-aware, reasoning-centric question-answer (QA) pairs directly from the given images. (2) The open-source WeThink dataset containing over 120K multimodal QA pairs with annotated reasoning paths, curated from 18 diverse dataset sources and covering various question domains. (3) A…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yangjie-cv/wethink
pytorchOfficial

Models

🤗
yangjie-cv/WeThink-Qwen2.5VL-7B
model· 1.2k dl· ♡ 5
1.2k dl♡ 5

Datasets

yangjie-cv/WeThink_Multimodal_Reasoning_120K
dataset· 24 dl
24 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques