TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation
Tanzila Rahman, Mengyu Yang, Leonid Sigal

TL;DR
TriBERT is a transformer-based model that learns comprehensive audio-visual representations including vision, pose, and audio, significantly improving sound source separation and cross-modal retrieval tasks.
Contribution
The paper introduces TriBERT, a novel multi-modal transformer architecture with pose integration and weak supervision for granular audio-visual tasks.
Findings
Enhanced sound source separation performance on MUSIC21 dataset.
Significant improvement (up to 66.7%) in cross-modal pose retrieval accuracy.
Effective learning of multi-modal features with weak supervision.
Abstract
The recent success of transformer models in language, such as BERT, has motivated the use of such architectures for multi-modal feature learning and tasks. However, most multi-modal variants (e.g., ViLBERT) have limited themselves to visual-linguistic data. Relatively few have explored its use in audio-visual modalities, and none, to our knowledge, illustrate them in the context of granular audio-visual detection or segmentation tasks such as sound source separation and localization. In this work, we introduce TriBERT -- a transformer-based architecture, inspired by ViLBERT, which enables contextual feature learning across three modalities: vision, pose, and audio, with the use of flexible co-attention. The use of pose keypoints is inspired by recent works that illustrate that such representations can significantly boost performance in many audio-visual scenarios where often one or more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Video Analysis and Summarization
MethodsLinear Layer · Layer Normalization · Attention Is All You Need · Softmax · Dense Connections · Residual Connection · WordPiece · Linear Warmup With Linear Decay · Weight Decay · Attention Dropout
