CLGRPO: Reasoning Ability Enhancement for Small VLMs
Fanyi Wang, Binzhi Dong, Haotian Hu, Jinjin Xu, Zhiwang Zhang

TL;DR
This paper introduces a novel post-training optimization method called CLGRPO to significantly enhance the reasoning abilities of small vision-language models, making them comparable to larger models.
Contribution
The paper presents a four-stage incremental training strategy and a self-supervised COT data construction system to improve reasoning in small VLMs, a novel approach in the field.
Findings
Significant accuracy improvement on EMOSet-118K dataset
Achieved performance comparable to 8B models with 1B SVLM
Enhanced reasoning ability through staged training and CLGRPO
Abstract
Small Vision Language Models (SVLMs) generally refer to models with parameter sizes less than or equal to 2B. Their low cost and power consumption characteristics confer high commercial value. However, their reasoning abilities are limited by the number of parameters. To address this issue, this paper proposes a post-training optimization paradigm called the Incremental Training Strategy to enhance the reasoning ability of SVLMs. Firstly, we constructed a Self-Supervised Chain-of-Thought (COT) Data Construction System, which leverages multiple LVLMs with 7B parameters or more to transform original data into COT data in a self-supervised manner. Our proposed Incremental Training Strategy consists of four stages. Stage 1 injects domain knowledge by performing Supervised Fine-Tuning (SFT) to the pretrained model on the COT data. Stage 2 aligns the COT data format by conducting a small…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
