DCTM: Dilated Convolutional Transformer Model for Multimodal Engagement   Estimation in Conversation

Vu Ngoc Tu; Van Thong Huynh; Hyung-Jeong Yang; M. Zaigham Zaheer; Shah; Nawaz; Karthik Nandakumar; Soo-Hyung Kim

arXiv:2308.01966·cs.MM·August 7, 2023·1 cites

DCTM: Dilated Convolutional Transformer Model for Multimodal Engagement Estimation in Conversation

Vu Ngoc Tu, Van Thong Huynh, Hyung-Jeong Yang, M. Zaigham Zaheer, Shah, Nawaz, Karthik Nandakumar, Soo-Hyung Kim

PDF

Open Access

TL;DR

This paper introduces a dilated convolutional Transformer model for estimating conversational engagement, achieving significant improvements over baselines by effectively fusing multimodal data.

Contribution

The study presents a novel dilated convolutional Transformer architecture and demonstrates its effectiveness in multimodal engagement estimation in conversations.

Findings

01

7% improvement on test set over baselines

02

Simple concatenation with self-attention fusion performs best

03

Effective multimodal fusion enhances engagement estimation

Abstract

Conversational engagement estimation is posed as a regression problem, entailing the identification of the favorable attention and involvement of the participants in the conversation. This task arises as a crucial pursuit to gain insights into human's interaction dynamics and behavior patterns within a conversation. In this research, we introduce a dilated convolutional Transformer for modeling and estimating human engagement in the MULTIMEDIATE 2023 competition. Our proposed system surpasses the baseline models, exhibiting a noteworthy $7$ \% improvement on test set and $4$ \% on validation set. Moreover, we employ different modality fusion mechanism and show that for this type of data, a simple concatenated method with self-attention fusion gains the best performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Human Pose and Action Recognition · Speech and dialogue systems

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Adam · Dense Connections · Label Smoothing · Dropout · Absolute Position Encodings · Byte Pair Encoding