Global-Local Distillation Network-Based Audio-Visual Speaker Tracking with Incomplete Modalities
Yidi Li, Yihan Li, Yixin Guo, Bin Ren, Zhenhuan Xu, Hao Guo, Hong Liu,, Nicu Sebe

TL;DR
This paper introduces GLDTracker, a novel audio-visual speaker tracking system that effectively handles incomplete modalities using a teacher-student distillation framework and global-local feature reconstruction, improving robustness in complex scenes.
Contribution
The paper proposes a global-local distillation network that fuses incomplete multi-modal data for robust speaker tracking, a novel approach in handling missing modalities.
Findings
Outperforms existing state-of-the-art trackers on AV16.3 dataset.
Demonstrates robustness with incomplete audio-visual data.
Achieves superior accuracy in complex dynamic scenes.
Abstract
In speaker tracking research, integrating and complementing multi-modal data is a crucial strategy for improving the accuracy and robustness of tracking systems. However, tracking with incomplete modalities remains a challenging issue due to noisy observations caused by occlusion, acoustic noise, and sensor failures. Especially when there is missing data in multiple modalities, the performance of existing multi-modal fusion methods tends to decrease. To this end, we propose a Global-Local Distillation-based Tracker (GLDTracker) for robust audio-visual speaker tracking. GLDTracker is driven by a teacher-student distillation model, enabling the flexible fusion of incomplete information from each modality. The teacher network processes global signals captured by camera and microphone arrays, and the student network handles local information subject to visual occlusion and missing audio…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Music and Audio Processing
MethodsSoftmax · Attention Is All You Need
