GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning
Yicheng Wang, Zhikang Zhang, Jue Wang, David Fan, Zhenlin Xu, Linda, Liu, Xiang Hao, Vimal Bhat, Xinyu Li

TL;DR
GEXIA introduces a scalable method for multi-grained video-language learning by expanding data granularity and iteratively approximating multi-grained representations, achieving state-of-the-art results across diverse video tasks.
Contribution
The paper presents GEXIA, a novel approach combining granularity expansion and iterative approximation to enable scalable multi-grained video-language pretraining.
Findings
Achieves state-of-the-art performance on multiple video tasks.
Excels in long-form video understanding despite training on short clips.
Scalable method with no restrictions on granularity levels.
Abstract
In various video-language learning tasks, the challenge of achieving cross-modality alignment with multi-grained data persists. We propose a method to tackle this challenge from two crucial perspectives: data and modeling. Given the absence of a multi-grained video-text pretraining dataset, we introduce a Granularity EXpansion (GEX) method with Integration and Compression operations to expand the granularity of a single-grained dataset. To better model multi-grained data, we introduce an Iterative Approximation Module (IAM), which embeds multi-grained videos and texts into a unified, low-dimensional semantic space while preserving essential information for cross-modal alignment. Furthermore, GEXIA is highly scalable with no restrictions on the number of video-text granularities for alignment. We evaluate our work on three categories of video tasks across seven benchmark datasets,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques · Multimodal Machine Learning Applications · Video Analysis and Summarization
