Object Segmentation with Audio Context

Kaihui Zheng; Yuqing Ren; Zixin Shen; Tianxu Qin

arXiv:2301.10295·cs.CV·January 26, 2023

Object Segmentation with Audio Context

Kaihui Zheng, Yuqing Ren, Zixin Shen, Tianxu Qin

PDF

Open Access

TL;DR

This paper introduces a novel audio-visual approach to video instance segmentation by integrating audio features, demonstrating slight improvements and providing a new dataset for vocal classes.

Contribution

First exploration of audio-visual integration in video instance segmentation, including a new dataset and a combined decoder for feature fusion.

Findings

01

Slight performance improvements over the base model

02

Effective multimodal feature fusion demonstrated

03

New dataset with 20 vocal classes created

Abstract

Visual objects often have acoustic signatures that are naturally synchronized with them in audio-bearing video recordings. For this project, we explore the multimodal feature aggregation for video instance segmentation task, in which we integrate audio features into our video segmentation model to conduct an audio-visual learning scheme. Our method is based on existing video instance segmentation method which leverages rich contextual information across video frames. Since this is the first attempt to investigate the audio-visual instance segmentation, a novel dataset, including 20 vocal classes with synchronized video and audio recordings, is collected. By utilizing combined decoder to fuse both video and audio features, our model shows a slight improvements compared to the base model. Additionally, we managed to show the effectiveness of different modules by conducting extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization

MethodsBalanced Selection