Beyond Single-Sample: Reliable Multi-Sample Distillation for Video Understanding

Songlin Li; Xin Zhu; Zechao Guan; Peipeng Chen; Jian Yao

arXiv:2603.11423·cs.CV·March 13, 2026

Beyond Single-Sample: Reliable Multi-Sample Distillation for Video Understanding

Songlin Li, Xin Zhu, Zechao Guan, Peipeng Chen, Jian Yao

PDF

Open Access

TL;DR

This paper introduces R-MSD, a multi-sample distillation framework for video understanding that reduces variance and improves knowledge transfer by leveraging a teacher pool and quality-aware filtering, outperforming single-sample methods.

Contribution

The paper presents R-MSD, a novel multi-sample distillation approach that models teacher sampling variance and uses a task-adaptive teacher pool for more reliable supervision in video understanding.

Findings

01

R-MSD outperforms single sample distillation across multiple benchmarks.

02

Significant improvements on VideoMME, Video-MMMU, and MathVerse datasets.

03

Marginal gains from baseline SFT+RL 4B model under same training budget.

Abstract

Traditional black-box distillation for Large Vision-Language Models (LVLMs) typically relies on a single teacher response per input, which often yields high-variance responses and format inconsistencies in multimodal or temporal scenarios. To mitigate this unreliable supervision, we propose R-MSD (Reliable Multi-Sample Distillation), a framework that explicitly models teacher sampling variance to enhance distillation stability. Rather than relying on a single teacher response, our approach leverages a task-adaptive teacher pool to provide robust supervision tailored to both closed-ended and open-ended reasoning. By integrating quality-aware signal matching with an adversarial distillation objective, our approach effectively filters teacher noise while maximizing knowledge transfer. Extensive evaluations across comprehensive video understanding benchmarks demonstrate that R-MSD…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning