# AFCLNet: An Attention and Feature-Consistency-Loss-Based Multi-Task Learning Network for Affective Matching Prediction in Music–Video Clips

**Authors:** Zhibin Su, Jinyu Liu, Luyue Zhang, Yiming Feng, Hui Ren

PMC · DOI: 10.3390/s26010123 · Sensors (Basel, Switzerland) · 2025-12-24

## TL;DR

This paper introduces AFCLNet, a new multi-task learning network that improves emotion matching between music and video clips using attention and feature consistency loss.

## Contribution

The novel contribution is an attention-based multi-task learning framework with a feature-consistency loss for better cross-modal affective matching.

## Key findings

- AFCLNet achieves a mean absolute error of 0.109 on a self-collected benchmark dataset.
- The proposed method outperforms existing approaches in music–video affective matching prediction.
- The decoupled Deep Canonical Correlation Analysis improves cross-modal feature projection.

## Abstract

Emotion matching prediction between music and video segments is essential for intelligent mobile sensing systems, where multimodal affective cues collected from smart devices must be jointly analyzed for context-aware media understanding. However, traditional approaches relying on single-modality feature extraction struggle to capture complex cross-modal dependencies, resulting in a gap between low-level audiovisual signals and high-level affective semantics. To address these challenges, a dual-driven framework that integrates perceptual characteristics with objective feature representations is proposed for audiovisual affective matching prediction. The framework incorporates fine-grained affective states of audiovisual data to better characterize cross-modal correlations from an emotional distribution perspective. Moreover, a decoupled Deep Canonical Correlation Analysis approach is developed, incorporating discriminative sample-pairing criteria (matched/mismatched data discrimination) and separate modality-specific component extractors, which dynamically refine the feature projection space. To further enhance multimodal feature interaction, an Attention and Feature-Consistency-Loss-Based Multi-Task Learning Network is proposed. In addition, a feature-consistency loss function is introduced to impose joint constraints across dual semantic embeddings, ensuring both affective consistency and matching accuracy. Experiments on a self-collected benchmark dataset demonstrate that the proposed method achieves a mean absolute error of 0.109 in music–video matching score prediction, significantly outperforming existing approaches.

## Full-text entities

- **Diseases:** injury to (MESH:D014947), AML (MESH:D015470)
- **Chemicals:** DCCA (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12788018/full.md

## Figures

13 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12788018/full.md

## References

45 references — full list in the complete paper: https://tomesphere.com/paper/PMC12788018/full.md

---
Source: https://tomesphere.com/paper/PMC12788018