Cross-Modal Dual-Causal Learning for Long-Term Action Recognition

Xu Shaowu; Jia Xibin; Gao Junyu; Sun Qianmei; Chang Jing; Fan Chao

arXiv:2507.06603·cs.CV·August 4, 2025

Cross-Modal Dual-Causal Learning for Long-Term Action Recognition

Xu Shaowu, Jia Xibin, Gao Junyu, Sun Qianmei, Chang Jing, Fan Chao

PDF

Open Access

TL;DR

This paper introduces CMDCL, a novel cross-modal dual-causal learning framework that improves long-term action recognition by modeling causal relationships between videos and texts, addressing biases and confounders.

Contribution

It proposes a structural causal model for cross-modal causal learning in LTAR, incorporating dual causal interventions to enhance robustness over existing methods.

Findings

01

Outperforms baselines on Charades, Breakfast, and COIN datasets.

02

Effectively removes cross-modal biases and visual confounders.

03

Demonstrates robustness in long-term action recognition tasks.

Abstract

Long-term action recognition (LTAR) is challenging due to extended temporal spans with complex atomic action correlations and visual confounders. Although vision-language models (VLMs) have shown promise, they often rely on statistical correlations instead of causal mechanisms. Moreover, existing causality-based methods address modal-specific biases but lack cross-modal causal modeling, limiting their utility in VLM-based LTAR. This paper proposes \textbf{C}ross-\textbf{M}odal \textbf{D}ual-\textbf{C}ausal \textbf{L}earning (CMDCL), which introduces a structural causal model to uncover causal relationships between videos and label texts. CMDCL addresses cross-modal biases in text embeddings via textual causal intervention and removes confounders inherent in the visual modality through visual causal intervention guided by the debiased text. These dual-causal interventions enable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Action Observation and Synchronization