RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification

Meilong Xu; Di Fu; Jiaxing Zhang; Gong Yu; Jiayu Zheng; Xiaoling Hu; Dongdi Zhao; Feiyang Li; Chao Chen; Yong Cao

arXiv:2511.15923·cs.CV·November 21, 2025

RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification

Meilong Xu, Di Fu, Jiaxing Zhang, Gong Yu, Jiayu Zheng, Xiaoling Hu, Dongdi Zhao, Feiyang Li, Chao Chen, Yong Cao

PDF

Open Access

TL;DR

This paper introduces RB-FT, a two-stage self-improvement method that enhances vision language models for domain-specific video classification by generating and fine-tuning on self-created rationales, reducing the need for new annotations.

Contribution

The paper presents a novel rationale-based fine-tuning approach that improves domain adaptation of VLMs for video classification without additional annotations.

Findings

01

Significant performance improvements over direct supervised fine-tuning.

02

Effective use of self-generated rationales as intermediate supervision.

03

Validated across diverse video datasets.

Abstract

Vision Language Models (VLMs) are becoming increasingly integral to multimedia understanding; however, they often struggle with domain-specific video classification tasks, particularly in cases with limited data. This stems from a critical \textit{rationale gap}, where sparse domain data is insufficient to bridge the semantic distance between complex spatio-temporal content and abstract classification labels. We propose a two-stage self-improvement paradigm to bridge this gap without new annotations. First, we prompt the VLMs to generate detailed textual rationales for each video, compelling them to articulate the domain-specific logic. The VLM is then fine-tuned on these self-generated rationales, utilizing this intermediate supervision to align its representations with the nuances of the target domain. Second, conventional supervised fine-tuning (SFT) is performed on the task labels,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition