Loading paper
Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams | Tomesphere