Dynamic Reflections: Probing Video Representations with Text Alignment
Tyler Zhu, Tengda Han, Leonidas Guibas, Viorica P\u{a}tr\u{a}ucean, Maks Ovsjanikov

TL;DR
This paper explores the alignment between video and text representations, revealing how data richness affects cross-modal alignment and proposing predictive scaling laws, thereby offering a new zero-shot probing method for video understanding.
Contribution
It is the first comprehensive study of video-text alignment, introducing test-time scaling laws and linking alignment quality to downstream task performance.
Findings
Alignment depends on visual and textual data richness.
Proposed scaling laws predict alignment behavior.
Strong alignment correlates with better downstream performance.
Abstract
The alignment of representations from different modalities has recently been shown to provide insights on the structural similarities and downstream capabilities of different encoders across diverse data types. While significant progress has been made in aligning images with text, the temporal nature of video data remains largely unexplored in this context. In this work, we conduct the first comprehensive study of video-text representation alignment, probing the capabilities of modern video and language encoders. Our findings reveal several key insights. First, we demonstrate that cross-modal alignment highly depends on the richness of both visual (static images vs. multi-frame videos) and text (single caption vs. a collection) data provided at test time, especially when using state-of-the-art video encoders. We propose parametric test-time scaling laws that capture this behavior and…
Peer Reviews
Decision·ICLR 2026 Poster
1, The study of the video-text alignment is meaningful and important. 2, Provides comprehensive experiments, which requires a lot of hardwork. 3, the idea of test-time scaling seems sound 4, Obervation provided in L182-186,L484-485 is informative.
1, Limited novelty in methodology. The approach seems a empirical report, might not meet ICLR’s innovation threshold. 2, abstract is different from the paper, could be misleading. 3, the improvement over previous image based methods seems limited.
**Comprehensive Approach**: The paper provides the first comprehensive study of video-text representation alignment, extending the Platonic Representation Hypothesis to the temporal domain, making it a significant contribution. **Correlation with Downstream Tasks**: The correlation between alignment scores and performance on semantic and non-semantic tasks demonstrates the practical value of alignment as a metric.
The idea of probing visual representation with video-text alignment is not such convincing. This evaluation is fair for models proposed on cross-modal tasks, but visual ability is not only cross-modal alignment. For example, in tasks such as video object detection and video object tracking, the vision model only need to detect pixel-level difference in the picture, without the need to be aware of textual semantics. The DINO-series [1], SAM-seris [2], I-JEPA [3] and V-JEPA [4] are some evid
1. Large-Scale Empirical Study: The primary strength of this work lies in its experimental rigor. The authors conduct a comprehensive analysis across a vast suite of 63 vision models and 30 language models on multiple datasets. 2. Critical Baseline: The paper's systematic use of a powerful image-encoder-plus-frame-averaging baseline is a significant contribution. The fact that this simple baseline outperforms many purpose-built video models is a critical finding for the community. 3. Novel Phe
1. The Scaling Law: I am concerning about the scaling law proposed in the paper. What is the true significance of scaling data to boost alignment scores? Isn't the high alignment achieved this way just an artificial way to minimize error? Fundamentally, isn't this just about providing more information to reduce the randomness of the MkNN metric and make the metric itself more robust? But if a model is incapable of producing a robust, comprehensive, and unambiguous representation from a single pi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
