StructAlign: Structured Cross-Modal Alignment for Continual Text-to-Video Retrieval

Shaokun Wang; Weili Guan; Jizhou Han; Jianlong Wu; Yupeng Hu; Liqiang Nie

arXiv:2601.20597·cs.CV·April 28, 2026

StructAlign: Structured Cross-Modal Alignment for Continual Text-to-Video Retrieval

Shaokun Wang, Weili Guan, Jizhou Han, Jianlong Wu, Yupeng Hu, Liqiang Nie

PDF

TL;DR

StructAlign is a novel method that uses geometric priors and relation-preserving losses to improve continual text-to-video retrieval by mitigating feature drift and modality misalignment.

Contribution

It introduces a structured cross-modal alignment framework with a simplex ETF geometry and novel loss functions to address catastrophic forgetting in CTVR.

Findings

01

Outperforms state-of-the-art continual retrieval methods on benchmark datasets.

02

Effectively mitigates intra-modal and cross-modal feature drift.

03

Enhances stability and accuracy in incremental learning of text-video associations.

Abstract

Continual Text-to-Video Retrieval (CTVR) is a challenging multimodal continual learning setting, where models must incrementally learn new semantic categories while maintaining accurate text-video alignment for previously learned ones, thus making it particularly prone to catastrophic forgetting. A key challenge in CTVR is feature drift, which manifests in two forms: intra-modal feature drift caused by continual learning within each modality, and non-cooperative feature drift across modalities that leads to modality misalignment. To mitigate these issues, we propose StructAlign, a structured cross-modal alignment method for CTVR. First, StructAlign introduces a simplex Equiangular Tight Frame (ETF) geometry as a unified geometric prior to mitigate modality misalignment. Building upon this geometric prior, we design a cross-modal ETF alignment loss that aligns text and video features…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.