GAIS: Frame-Level Gated Audio-Visual Integration with Semantic Variance-Scaled Perturbation for Text-Video Retrieval

Bowen Yang; Yun Cao; Chen He; Xiaosu Su

arXiv:2508.01711·cs.CV·November 19, 2025

GAIS: Frame-Level Gated Audio-Visual Integration with Semantic Variance-Scaled Perturbation for Text-Video Retrieval

Bowen Yang, Yun Cao, Chen He, Xiaosu Su

PDF

TL;DR

GAIS introduces a novel audio-visual retrieval framework with frame-level gating and semantic-aware regularization, significantly improving text-video retrieval accuracy by better aligning multimodal features.

Contribution

The paper proposes GAIS, a new framework combining fine-grained temporal fusion and semantic variance-scaled perturbation for enhanced multimodal alignment in text-video retrieval.

Findings

01

Outperforms strong baselines on multiple datasets.

02

Achieves better retrieval metrics with efficient computation.

03

Enhances multimodal representation quality.

Abstract

Text-to-video retrieval requires precise alignment between language and temporally rich audio-video signals. However, existing methods often emphasize visual cues while underutilizing audio semantics or relying on coarse fusion strategies, resulting in suboptimal multimodal representations. We introduce GAIS, a retrieval framework that strengthens multimodal alignment from both representation and regularization perspectives. First, a Frame-level Gated Fusion (FGF) module adaptively integrates audio-visual features under textual guidance, enabling fine-grained temporal selection of informative frames. Second, a Semantic Variance-Scaled Perturbation (SVSP) mechanism regularizes the text embedding space by controlling perturbation magnitude in a semantics-aware manner. These two modules are complementary: FGF minimizes modality gaps through selective fusion, while SVSP improves embedding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.