CyberV: Cybernetics for Test-time Scaling in Video Understanding

Jiahao Meng; Shuyang Sun; Yue Tan; Lu Qi; Yunhai Tong; Xiangtai Li; Longyin Wen

arXiv:2506.07971·cs.CV·June 10, 2025

CyberV: Cybernetics for Test-time Scaling in Video Understanding

Jiahao Meng, Shuyang Sun, Yue Tan, Lu Qi, Yunhai Tong, Xiangtai Li, Longyin Wen

PDF

Open Access 1 Repo

TL;DR

CyberV introduces a cybernetic framework for adaptive, test-time scaling of multimodal large language models, significantly improving their robustness and accuracy in understanding complex videos without retraining.

Contribution

The paper presents CyberV, a novel cybernetic-inspired framework that enables self-monitoring and dynamic resource allocation in video MLLMs during inference, enhancing performance without retraining.

Findings

01

Boosts Qwen2.5-VL-7B by 8.3% on VideoMMMU

02

Improves InternVL3-8B by 5.5% on VideoMMMU

03

Achieves 10.0% improvement on Qwen2.5-VL-72B, comparable to human experts

Abstract

Current Multimodal Large Language Models (MLLMs) may struggle with understanding long or complex videos due to computational demands at test time, lack of robustness, and limited accuracy, primarily stemming from their feed-forward processing nature. These limitations could be more severe for models with fewer parameters. To address these limitations, we propose a novel framework inspired by cybernetic principles, redesigning video MLLMs as adaptive systems capable of self-monitoring, self-correction, and dynamic resource allocation during inference. Our approach, CyberV, introduces a cybernetic loop consisting of an MLLM Inference System, a Sensor, and a Controller. Specifically, the sensor monitors forward processes of the MLLM and collects intermediate interpretations, such as attention drift, then the controller determines when and how to trigger self-correction and generate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

marinero4972/cyberv
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis

MethodsSoftmax · Attention Is All You Need