Multimodal Transformer Distillation for Audio-Visual Synchronization

Xuanjun Chen; Haibin Wu; Chung-Che Wang; Hung-yi Lee; Jyh-Shing Roger; Jang

arXiv:2210.15563·cs.CV·March 19, 2024

Multimodal Transformer Distillation for Audio-Visual Synchronization

Xuanjun Chen, Haibin Wu, Chung-Che Wang, Hung-yi Lee, Jyh-Shing Roger, Jang

PDF

Open Access 2 Repos

TL;DR

This paper introduces MTDVocaLiST, a distilled multimodal Transformer model for audio-visual synchronization that achieves high accuracy with significantly reduced computational resources by mimicking the original model's attention mechanisms.

Contribution

The paper proposes a novel multimodal Transformer distillation (MTD) loss and uncertainty weighting to effectively compress the VocaLiST model while maintaining its performance.

Findings

01

MTD loss outperforms other distillation methods.

02

MTDVocaLiST surpasses state-of-the-art models by 15.65%.

03

Model size is reduced by 83.52% with similar accuracy.

Abstract

Audio-visual synchronization aims to determine whether the mouth movements and speech in the video are synchronized. VocaLiST reaches state-of-the-art performance by incorporating multimodal Transformers to model audio-visual interact information. However, it requires high computing resources, making it impractical for real-world applications. This paper proposed an MTDVocaLiST model, which is trained by our proposed multimodal Transformer distillation (MTD) loss. MTD loss enables MTDVocaLiST model to deeply mimic the cross-attention distribution and value-relation in the Transformer of VocaLiST. Additionally, we harness uncertainty weighting to fully exploit the interaction information across all layers. Our proposed method is effective in two aspects: From the distillation method perspective, MTD loss outperforms other strong distillation baselines. From the distilled model's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media · Video Analysis and Summarization · Speech and Audio Processing

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Adam · Label Smoothing · Position-Wise Feed-Forward Layer · Dense Connections · Absolute Position Encodings · Layer Normalization