TriBERT: Full-body Human-centric Audio-visual Representation Learning   for Visual Sound Separation

Tanzila Rahman; Mengyu Yang; Leonid Sigal

arXiv:2110.13412·cs.CV·October 27, 2021

TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation

Tanzila Rahman, Mengyu Yang, Leonid Sigal

PDF

Open Access 1 Repo

TL;DR

TriBERT is a transformer-based model that learns comprehensive audio-visual representations including vision, pose, and audio, significantly improving sound source separation and cross-modal retrieval tasks.

Contribution

The paper introduces TriBERT, a novel multi-modal transformer architecture with pose integration and weak supervision for granular audio-visual tasks.

Findings

01

Enhanced sound source separation performance on MUSIC21 dataset.

02

Significant improvement (up to 66.7%) in cross-modal pose retrieval accuracy.

03

Effective learning of multi-modal features with weak supervision.

Abstract

The recent success of transformer models in language, such as BERT, has motivated the use of such architectures for multi-modal feature learning and tasks. However, most multi-modal variants (e.g., ViLBERT) have limited themselves to visual-linguistic data. Relatively few have explored its use in audio-visual modalities, and none, to our knowledge, illustrate them in the context of granular audio-visual detection or segmentation tasks such as sound source separation and localization. In this work, we introduce TriBERT -- a transformer-based architecture, inspired by ViLBERT, which enables contextual feature learning across three modalities: vision, pose, and audio, with the use of flexible co-attention. The use of pose keypoints is inspired by recent works that illustrate that such representations can significantly boost performance in many audio-visual scenarios where often one or more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ubc-vision/tribert
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Video Analysis and Summarization

MethodsLinear Layer · Layer Normalization · Attention Is All You Need · Softmax · Dense Connections · Residual Connection · WordPiece · Linear Warmup With Linear Decay · Weight Decay · Attention Dropout