Dynamic Inter-Class Confusion-Aware Encoder for Audio-Visual Fusion in Human Activity Recognition

Kaixuan Cong; Yifan Wang; Rongkun Xue; Yuyang Jiang; Yiming Feng; Jing Yang

arXiv:2507.09323·cs.CV·July 15, 2025

Dynamic Inter-Class Confusion-Aware Encoder for Audio-Visual Fusion in Human Activity Recognition

Kaixuan Cong, Yifan Wang, Rongkun Xue, Yuyang Jiang, Yiming Feng, Jing Yang

PDF

TL;DR

This paper introduces DICCAE, a novel encoder that improves audio-visual human activity recognition by dynamically addressing inter-class confusion and leveraging a self-supervised pre-training strategy, achieving near state-of-the-art results.

Contribution

The paper proposes a category-level alignment encoder with dynamic confusion loss and a self-supervised pre-training framework for audio-visual human activity recognition.

Findings

01

Achieves 65.5% top-1 accuracy on VGGSound dataset.

02

Effectively distinguishes similar activities through dynamic confusion adjustment.

03

Validates each module's contribution via ablation studies.

Abstract

Humans do not understand individual events in isolation; rather, they generalize concepts within classes and compare them to others. Existing audio-video pre-training paradigms only focus on the alignment of the overall audio-video modalities, without considering the reinforcement of distinguishing easily confused classes through cognitive induction and contrast during training. This paper proposes the Dynamic Inter-Class Confusion-Aware Encoder (DICCAE), an encoder that aligns audio-video representations at a fine-grained, category-level. DICCAE addresses category confusion by dynamically adjusting the confusion loss based on inter-class confusion degrees, thereby enhancing the model's ability to distinguish between similar activities. To further extend the application of DICCAE, we also introduce a novel training framework that incorporates both audio and video modalities, as well as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.