Loading paper
CLIP-VAD: Exploiting Vision-Language Models for Voice Activity Detection | Tomesphere