Continuous Perception Matters: Diagnosing Temporal Integration Failures in Multimodal Models
Zeyu Wang, Zhenzhen Weng, Serena Yeung-Levy

TL;DR
This paper introduces CP-Bench, a benchmark to test continuous perception in multimodal models, revealing significant failures in temporal integration across state-of-the-art systems.
Contribution
The paper presents CP-Bench, a controlled benchmark for diagnosing temporal integration failures in multimodal models, highlighting their inability to accumulate evidence over time.
Findings
State-of-the-art models fail dramatically on CP-Bench.
Increasing sampling FPS or finetuning does not improve temporal integration.
Modern architectures have a fundamental limitation in continuous perception.
Abstract
Continuous perception, the ability to integrate visual observations over time in a continuous stream fashion, is essential for robust real-world understanding, yet remains largely untested in current multimodal models. We introduce CP-Bench, a minimal and fully controlled benchmark designed to isolate this capability using an extremely simple task: counting identical cubes in a synthetic scene while the camera moves and only reveals subsets of objects at any moment. Despite the simplicity of the setting, we find that state-of-the-art open-source and commercial models, including Qwen-3-VL, InternVL3, GPT-5, and Gemini-3-Pro, fail dramatically. A static-camera control variant confirms that the failure arises not from object recognition but from an inability to accumulate evidence across time. Further experiments show that neither higher sampling FPS, perception- or spatial-enhanced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIndustrial Vision Systems and Defect Detection · Advanced Vision and Imaging · Hand Gesture Recognition Systems
