SSAVSV: Towards Unified Model for Self-Supervised Audio-Visual Speaker Verification

Gnana Praveen Rajasekhar; Jahangir Alam

arXiv:2506.17694·cs.CV·June 25, 2025

SSAVSV: Towards Unified Model for Self-Supervised Audio-Visual Speaker Verification

Gnana Praveen Rajasekhar, Jahangir Alam

PDF

Open Access

TL;DR

This paper introduces a self-supervised audiovisual speaker verification framework using a shared vision transformer backbone, enabling efficient, modality-agnostic verification without labeled data, and demonstrating competitive performance.

Contribution

It presents a unified self-supervised learning approach with a single backbone for audio and visual inputs, reducing computational costs and handling missing modalities.

Findings

01

Achieves competitive verification performance without labeled data.

02

Reduces computational costs compared to traditional methods.

03

Handles missing modalities effectively.

Abstract

Conventional audio-visual methods for speaker verification rely on large amounts of labeled data and separate modality-specific architectures, which is computationally expensive, limiting their scalability. To address these problems, we propose a self-supervised learning framework based on contrastive learning with asymmetric masking and masked data modeling to obtain robust audiovisual feature representations. In particular, we employ a unified framework for self-supervised audiovisual speaker verification using a single shared backbone for audio and visual inputs, leveraging the versatility of vision transformers. The proposed unified framework can handle audio, visual, or audiovisual inputs using a single shared vision transformer backbone during training and testing while being computationally efficient and robust to missing modalities. Extensive experiments demonstrate that our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Face recognition and analysis

MethodsContrastive Learning · Dense Connections · Layer Normalization · Vision Transformer