The Right to Talk: An Audio-Visual Transformer Approach

Thanh-Dat Truong; Chi Nhan Duong; The De Vu; Hoang Anh Pham; Bhiksha; Raj; Ngan Le; Khoa Luu

arXiv:2108.03256·cs.CV·August 29, 2021

The Right to Talk: An Audio-Visual Transformer Approach

Thanh-Dat Truong, Chi Nhan Duong, The De Vu, Hoang Anh Pham, Bhiksha, Raj, Ngan Le, Khoa Luu

PDF

1 Repo

TL;DR

This paper presents a novel Audio-Visual Transformer method for localizing and highlighting the main speaker in multi-speaker videos, effectively exploiting cross-modal and temporal relationships to improve speaker detection.

Contribution

It introduces a new Transformer-based approach that models audio-visual correlations and temporal dynamics for main speaker localization, along with a newly collected dataset.

Findings

01

Effective localization of main speakers in multi-speaker videos.

02

Improved accuracy over previous methods in audio-visual speaker detection.

03

First study to automatically localize main speakers in both audio and visual channels.

Abstract

Turn-taking has played an essential role in structuring the regulation of a conversation. The task of identifying the main speaker (who is properly taking his/her turn of speaking) and the interrupters (who are interrupting or reacting to the main speaker's utterances) remains a challenging task. Although some prior methods have partially addressed this task, there still remain some limitations. Firstly, a direct association of Audio and Visual features may limit the correlations to be extracted due to different modalities. Secondly, the relationship across temporal segments helping to maintain the consistency of localization, separation, and conversation contexts is not effectively exploited. Finally, the interactions between speakers that usually contain the tracking and anticipatory decisions about the transition to a new speaker are usually ignored. Therefore, this work introduces a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

uark-cviu/right2talk
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Dropout · Label Smoothing · Residual Connection · Byte Pair Encoding