Global-Local Distillation Network-Based Audio-Visual Speaker Tracking   with Incomplete Modalities

Yidi Li; Yihan Li; Yixin Guo; Bin Ren; Zhenhuan Xu; Hao Guo; Hong Liu,; Nicu Sebe

arXiv:2408.14585·cs.CV·February 18, 2025

Global-Local Distillation Network-Based Audio-Visual Speaker Tracking with Incomplete Modalities

Yidi Li, Yihan Li, Yixin Guo, Bin Ren, Zhenhuan Xu, Hao Guo, Hong Liu,, Nicu Sebe

PDF

Open Access

TL;DR

This paper introduces GLDTracker, a novel audio-visual speaker tracking system that effectively handles incomplete modalities using a teacher-student distillation framework and global-local feature reconstruction, improving robustness in complex scenes.

Contribution

The paper proposes a global-local distillation network that fuses incomplete multi-modal data for robust speaker tracking, a novel approach in handling missing modalities.

Findings

01

Outperforms existing state-of-the-art trackers on AV16.3 dataset.

02

Demonstrates robustness with incomplete audio-visual data.

03

Achieves superior accuracy in complex dynamic scenes.

Abstract

In speaker tracking research, integrating and complementing multi-modal data is a crucial strategy for improving the accuracy and robustness of tracking systems. However, tracking with incomplete modalities remains a challenging issue due to noisy observations caused by occlusion, acoustic noise, and sensor failures. Especially when there is missing data in multiple modalities, the performance of existing multi-modal fusion methods tends to decrease. To this end, we propose a Global-Local Distillation-based Tracker (GLDTracker) for robust audio-visual speaker tracking. GLDTracker is driven by a teacher-student distillation model, enabling the flexible fusion of incomplete information from each modality. The teacher network processes global signals captured by camera and microphone arrays, and the student network handles local information subject to visual occlusion and missing audio…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Music and Audio Processing

MethodsSoftmax · Attention Is All You Need