DAT: Dialogue-Aware Transformer with Modality-Group Fusion for Human   Engagement Estimation

Jia Li; Yangchen Yu; Yin Chen; Yu Zhang; Peng Jia; Yunbo Xu; Ziqiang; Li; Meng Wang; Richang Hong

arXiv:2410.08470·cs.HC·October 14, 2024

DAT: Dialogue-Aware Transformer with Modality-Group Fusion for Human Engagement Estimation

Jia Li, Yangchen Yu, Yin Chen, Yu Zhang, Peng Jia, Yunbo Xu, Ziqiang, Li, Meng Wang, Richang Hong

PDF

Open Access 1 Repo

TL;DR

This paper introduces DAT, a dialogue-aware transformer with modality-group fusion, for human engagement estimation from audio-visual data, achieving state-of-the-art results in multi-domain benchmarks.

Contribution

The paper presents a novel modality-group fusion strategy within a dialogue-aware transformer framework, improving robustness and performance in engagement estimation from audio-visual inputs.

Findings

01

Achieved a CCC score of 0.76 on NoXi Base test set.

02

Improved engagement-level regression accuracy over baseline models.

03

Demonstrated robustness across multiple datasets in the MultiMediate'24 challenge.

Abstract

Engagement estimation plays a crucial role in understanding human social behaviors, attracting increasing research interests in fields such as affective computing and human-computer interaction. In this paper, we propose a Dialogue-Aware Transformer framework (DAT) with Modality-Group Fusion (MGF), which relies solely on audio-visual input and is language-independent, for estimating human engagement in conversations. Specifically, our method employs a modality-group fusion strategy that independently fuses audio and visual features within each modality for each person before inferring the entire audio-visual content. This strategy significantly enhances the model's performance and robustness. Additionally, to better estimate the target participant's engagement levels, the introduced Dialogue-Aware Transformer considers both the participant's behavior and cues from their conversational…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

msa-lmc/dat
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsContext-Aware Activity Recognition Systems · Speech and dialogue systems · Seismology and Earthquake Studies

MethodsAttention Is All You Need · Linear Layer · Label Smoothing · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Dropout · Layer Normalization · Adam · Sparse Evolutionary Training