Loading paper
Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions | Tomesphere