There is More than Meets the Eye: Self-Supervised Multi-Object Detection and Tracking with Sound by Distilling Multimodal Knowledge
Francisco Rivera Valverde, Juana Valeria Hurtado, Abhinav Valada

TL;DR
This paper introduces a self-supervised multimodal framework that leverages RGB, depth, thermal, and audio data to improve multi-object detection and tracking, even in dynamic scenes, surpassing existing methods.
Contribution
The work presents a novel self-supervised distillation framework with a new MTA loss and a pretext task, enabling robust multi-object detection and tracking using sound across multiple modalities.
Findings
Outperforms state-of-the-art methods in multimodal object detection and tracking.
Effective in dynamic scenes with moving cameras and multiple objects.
Utilizes a large-scale multimodal dataset for training and evaluation.
Abstract
Attributes of sound inherent to objects can provide valuable cues to learn rich representations for object detection and tracking. Furthermore, the co-occurrence of audiovisual events in videos can be exploited to localize objects over the image field by solely monitoring the sound in the environment. Thus far, this has only been feasible in scenarios where the camera is static and for single object detection. Moreover, the robustness of these methods has been limited as they primarily rely on RGB images which are highly susceptible to illumination and weather changes. In this work, we present the novel self-supervised MM-DistillNet framework consisting of multiple teachers that leverage diverse modalities including RGB, depth and thermal images, to simultaneously exploit complementary cues and distill knowledge into a single audio student network. We propose the new MTA loss function…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
