DCTM: Dilated Convolutional Transformer Model for Multimodal Engagement Estimation in Conversation
Vu Ngoc Tu, Van Thong Huynh, Hyung-Jeong Yang, M. Zaigham Zaheer, Shah, Nawaz, Karthik Nandakumar, Soo-Hyung Kim

TL;DR
This paper introduces a dilated convolutional Transformer model for estimating conversational engagement, achieving significant improvements over baselines by effectively fusing multimodal data.
Contribution
The study presents a novel dilated convolutional Transformer architecture and demonstrates its effectiveness in multimodal engagement estimation in conversations.
Findings
7% improvement on test set over baselines
Simple concatenation with self-attention fusion performs best
Effective multimodal fusion enhances engagement estimation
Abstract
Conversational engagement estimation is posed as a regression problem, entailing the identification of the favorable attention and involvement of the participants in the conversation. This task arises as a crucial pursuit to gain insights into human's interaction dynamics and behavior patterns within a conversation. In this research, we introduce a dilated convolutional Transformer for modeling and estimating human engagement in the MULTIMEDIATE 2023 competition. Our proposed system surpasses the baseline models, exhibiting a noteworthy \% improvement on test set and \% on validation set. Moreover, we employ different modality fusion mechanism and show that for this type of data, a simple concatenated method with self-attention fusion gains the best performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Human Pose and Action Recognition · Speech and dialogue systems
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Adam · Dense Connections · Label Smoothing · Dropout · Absolute Position Encodings · Byte Pair Encoding
