Dynamic Reflections: Probing Video Representations with Text Alignment

Tyler Zhu; Tengda Han; Leonidas Guibas; Viorica P\u{a}tr\u{a}ucean; Maks Ovsjanikov

arXiv:2511.02767·cs.CV·February 2, 2026

Dynamic Reflections: Probing Video Representations with Text Alignment

Tyler Zhu, Tengda Han, Leonidas Guibas, Viorica P\u{a}tr\u{a}ucean, Maks Ovsjanikov

PDF

Open Access 3 Reviews

TL;DR

This paper explores the alignment between video and text representations, revealing how data richness affects cross-modal alignment and proposing predictive scaling laws, thereby offering a new zero-shot probing method for video understanding.

Contribution

It is the first comprehensive study of video-text alignment, introducing test-time scaling laws and linking alignment quality to downstream task performance.

Findings

01

Alignment depends on visual and textual data richness.

02

Proposed scaling laws predict alignment behavior.

03

Strong alignment correlates with better downstream performance.

Abstract

The alignment of representations from different modalities has recently been shown to provide insights on the structural similarities and downstream capabilities of different encoders across diverse data types. While significant progress has been made in aligning images with text, the temporal nature of video data remains largely unexplored in this context. In this work, we conduct the first comprehensive study of video-text representation alignment, probing the capabilities of modern video and language encoders. Our findings reveal several key insights. First, we demonstrate that cross-modal alignment highly depends on the richness of both visual (static images vs. multi-frame videos) and text (single caption vs. a collection) data provided at test time, especially when using state-of-the-art video encoders. We propose parametric test-time scaling laws that capture this behavior and…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1, The study of the video-text alignment is meaningful and important. 2, Provides comprehensive experiments, which requires a lot of hardwork. 3, the idea of test-time scaling seems sound 4, Obervation provided in L182-186,L484-485 is informative.

Weaknesses

1, Limited novelty in methodology. The approach seems a empirical report, might not meet ICLR’s innovation threshold. 2, abstract is different from the paper, could be misleading. 3, the improvement over previous image based methods seems limited.

Reviewer 02Rating 4Confidence 3

Strengths

**Comprehensive Approach**: The paper provides the first comprehensive study of video-text representation alignment, extending the Platonic Representation Hypothesis to the temporal domain, making it a significant contribution. **Correlation with Downstream Tasks**: The correlation between alignment scores and performance on semantic and non-semantic tasks demonstrates the practical value of alignment as a metric.

Weaknesses

The idea of probing visual representation with video-text alignment is not such convincing. This evaluation is fair for models proposed on cross-modal tasks, but visual ability is not only cross-modal alignment. For example, in tasks such as video object detection and video object tracking, the vision model only need to detect pixel-level difference in the picture, without the need to be aware of textual semantics. The DINO-series [1], SAM-seris [2], I-JEPA [3] and V-JEPA [4] are some evid

Reviewer 03Rating 6Confidence 4

Strengths

1. Large-Scale Empirical Study: The primary strength of this work lies in its experimental rigor. The authors conduct a comprehensive analysis across a vast suite of 63 vision models and 30 language models on multiple datasets. 2. Critical Baseline: The paper's systematic use of a powerful image-encoder-plus-frame-averaging baseline is a significant contribution. The fact that this simple baseline outperforms many purpose-built video models is a critical finding for the community. 3. Novel Phe

Weaknesses

1. The Scaling Law: I am concerning about the scaling law proposed in the paper. What is the true significance of scaling data to boost alignment scores? Isn't the high alignment achieved this way just an artificial way to minimize error? Fundamentally, isn't this just about providing more information to reduce the randomness of the MkNN metric and make the metric itself more robust? But if a model is incapable of producing a robust, comprehensive, and unambiguous representation from a single pi

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling