CLIP-VAD: Exploiting Vision-Language Models for Voice Activity Detection

Andrea Appiani; Cigdem Beyan

arXiv:2410.14509·cs.CV·October 21, 2024

CLIP-VAD: Exploiting Vision-Language Models for Voice Activity Detection

Andrea Appiani, Cigdem Beyan

PDF

Open Access

TL;DR

This paper introduces CLIP-VAD, a novel voice activity detection method that leverages vision-language models, specifically CLIP, to analyze video segments and textual prompts, achieving superior performance without extensive audio-visual pre-training.

Contribution

The study presents a new VAD approach using CLIP's visual and text encoders with prompt engineering, outperforming existing methods without large-scale audio-visual pre-training.

Findings

01

Outperforms existing visual VAD methods on three benchmarks.

02

Achieves superior results compared to audio-visual approaches.

03

Does not require extensive audio-visual dataset pre-training.

Abstract

Voice Activity Detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech in an audiovisual data. Traditionally, this task has been tackled by processing either audio signals or visual data, or by combining both modalities through fusion or joint learning. In our study, drawing inspiration from recent advancements in visual-language models, we introduce a novel approach leveraging Contrastive Language-Image Pretraining (CLIP) models. The CLIP visual encoder analyzes video segments composed of the upper body of an individual, while the text encoder handles textual descriptions automatically generated through prompt engineering. Subsequently, embeddings from these encoders are fused through a deep neural network to perform VAD. Our experimental analysis across three VAD benchmarks showcases the superior performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems

MethodsContrastive Language-Image Pre-training