There is More than Meets the Eye: Self-Supervised Multi-Object Detection   and Tracking with Sound by Distilling Multimodal Knowledge

Francisco Rivera Valverde; Juana Valeria Hurtado; Abhinav Valada

arXiv:2103.01353·cs.CV·November 5, 2021

There is More than Meets the Eye: Self-Supervised Multi-Object Detection and Tracking with Sound by Distilling Multimodal Knowledge

Francisco Rivera Valverde, Juana Valeria Hurtado, Abhinav Valada

PDF

TL;DR

This paper introduces a self-supervised multimodal framework that leverages RGB, depth, thermal, and audio data to improve multi-object detection and tracking, even in dynamic scenes, surpassing existing methods.

Contribution

The work presents a novel self-supervised distillation framework with a new MTA loss and a pretext task, enabling robust multi-object detection and tracking using sound across multiple modalities.

Findings

01

Outperforms state-of-the-art methods in multimodal object detection and tracking.

02

Effective in dynamic scenes with moving cameras and multiple objects.

03

Utilizes a large-scale multimodal dataset for training and evaluation.

Abstract

Attributes of sound inherent to objects can provide valuable cues to learn rich representations for object detection and tracking. Furthermore, the co-occurrence of audiovisual events in videos can be exploited to localize objects over the image field by solely monitoring the sound in the environment. Thus far, this has only been feasible in scenarios where the camera is static and for single object detection. Moreover, the robustness of these methods has been limited as they primarily rely on RGB images which are highly susceptible to illumination and weather changes. In this work, we present the novel self-supervised MM-DistillNet framework consisting of multiple teachers that leverage diverse modalities including RGB, depth and thermal images, to simultaneously exploit complementary cues and distill knowledge into a single audio student network. We propose the new MTA loss function…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.