CLGRPO: Reasoning Ability Enhancement for Small VLMs

Fanyi Wang; Binzhi Dong; Haotian Hu; Jinjin Xu; Zhiwang Zhang

arXiv:2506.18048·cs.CV·August 12, 2025

CLGRPO: Reasoning Ability Enhancement for Small VLMs

Fanyi Wang, Binzhi Dong, Haotian Hu, Jinjin Xu, Zhiwang Zhang

PDF

TL;DR

This paper introduces a novel post-training optimization method called CLGRPO to significantly enhance the reasoning abilities of small vision-language models, making them comparable to larger models.

Contribution

The paper presents a four-stage incremental training strategy and a self-supervised COT data construction system to improve reasoning in small VLMs, a novel approach in the field.

Findings

01

Significant accuracy improvement on EMOSet-118K dataset

02

Achieved performance comparable to 8B models with 1B SVLM

03

Enhanced reasoning ability through staged training and CLGRPO

Abstract

Small Vision Language Models (SVLMs) generally refer to models with parameter sizes less than or equal to 2B. Their low cost and power consumption characteristics confer high commercial value. However, their reasoning abilities are limited by the number of parameters. To address this issue, this paper proposes a post-training optimization paradigm called the Incremental Training Strategy to enhance the reasoning ability of SVLMs. Firstly, we constructed a Self-Supervised Chain-of-Thought (COT) Data Construction System, which leverages multiple LVLMs with 7B parameters or more to transform original data into COT data in a self-supervised manner. Our proposed Incremental Training Strategy consists of four stages. Stage 1 injects domain knowledge by performing Supervised Fine-Tuning (SFT) to the pretrained model on the COT data. Stage 2 aligns the COT data format by conducting a small…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.