Loading paper
Unsupervised Audio-Caption Aligning Learns Correspondences between Individual Sound Events and Textual Phrases | Tomesphere