Attention-Driven Multimodal Alignment for Long-term Action Quality Assessment
Xin Wang, Peng-Jie Li, Yuan-Yuan Shen

TL;DR
This paper introduces LMAC-Net, a novel multimodal attention model that improves long-term action quality assessment by explicitly aligning visual and audio cues, capturing complex interactions, and enhancing performance evaluation accuracy.
Contribution
The paper proposes a multimodal attention consistency mechanism with a local query encoder and dual-level scoring, advancing multimodal fusion for long-term action quality assessment.
Findings
LMAC-Net outperforms existing methods on RG and Fis-V datasets.
Explicit multimodal alignment improves assessment accuracy.
The approach effectively models complex cross-modal interactions.
Abstract
Long-term action quality assessment (AQA) focuses on evaluating the quality of human activities in videos lasting up to several minutes. This task plays an important role in the automated evaluation of artistic sports such as rhythmic gymnastics and figure skating, where both accurate motion execution and temporal synchronization with background music are essential for performance assessment. However, existing methods predominantly fall into two categories: unimodal approaches that rely solely on visual features, which are inadequate for modeling multimodal cues like music; and multimodal approaches that typically employ simple feature-level contrastive fusion, overlooking deep cross-modal collaboration and temporal dynamics. As a result, they struggle to capture complex interactions between modalities and fail to accurately track critical performance changes throughout extended…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
