T2VParser: Adaptive Decomposition Tokens for Partial Alignment in Text to Video Retrieval

Yili Li; Gang Xiong; Gaopeng Gou; Xiangyan Qu; Jiamin Zhuang; Zhen Li; Junzheng Shi

arXiv:2507.20518·cs.CV·July 29, 2025

T2VParser: Adaptive Decomposition Tokens for Partial Alignment in Text to Video Retrieval

Yili Li, Gang Xiong, Gaopeng Gou, Xiangyan Qu, Jiamin Zhuang, Zhen Li, Junzheng Shi

PDF

TL;DR

T2VParser introduces adaptive semantic decomposition tokens to improve partial alignment in text-to-video retrieval, effectively handling the richer and partial content of videos compared to images.

Contribution

The paper proposes Adaptive Decomposition Tokens for extracting multiview semantic representations, enabling precise partial alignment in text-video retrieval tasks.

Findings

01

Achieves accurate partial alignment in experiments

02

Effectively decomposes cross-modal content

03

Retains pretrained model knowledge

Abstract

Text-to-video retrieval essentially aims to train models to align visual content with textual descriptions accurately. Due to the impressive general multimodal knowledge demonstrated by image-text pretrained models such as CLIP, existing work has primarily focused on extending CLIP knowledge for video-text tasks. However, videos typically contain richer information than images. In current video-text datasets, textual descriptions can only reflect a portion of the video content, leading to partial misalignment in video-text matching. Therefore, directly aligning text representations with video representations can result in incorrect supervision, ignoring the inequivalence of information. In this work, we propose T2VParser to extract multiview semantic representations from text and video, achieving adaptive semantic alignment rather than aligning the entire representation. To extract…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.