CyberV: Cybernetics for Test-time Scaling in Video Understanding
Jiahao Meng, Shuyang Sun, Yue Tan, Lu Qi, Yunhai Tong, Xiangtai Li, Longyin Wen

TL;DR
CyberV introduces a cybernetic framework for adaptive, test-time scaling of multimodal large language models, significantly improving their robustness and accuracy in understanding complex videos without retraining.
Contribution
The paper presents CyberV, a novel cybernetic-inspired framework that enables self-monitoring and dynamic resource allocation in video MLLMs during inference, enhancing performance without retraining.
Findings
Boosts Qwen2.5-VL-7B by 8.3% on VideoMMMU
Improves InternVL3-8B by 5.5% on VideoMMMU
Achieves 10.0% improvement on Qwen2.5-VL-72B, comparable to human experts
Abstract
Current Multimodal Large Language Models (MLLMs) may struggle with understanding long or complex videos due to computational demands at test time, lack of robustness, and limited accuracy, primarily stemming from their feed-forward processing nature. These limitations could be more severe for models with fewer parameters. To address these limitations, we propose a novel framework inspired by cybernetic principles, redesigning video MLLMs as adaptive systems capable of self-monitoring, self-correction, and dynamic resource allocation during inference. Our approach, CyberV, introduces a cybernetic loop consisting of an MLLM Inference System, a Sensor, and a Controller. Specifically, the sensor monitors forward processes of the MLLM and collects intermediate interpretations, such as attention drift, then the controller determines when and how to trigger self-correction and generate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
MethodsSoftmax · Attention Is All You Need
