Loading paper
Textual Supervision for Visually Grounded Spoken Language Understanding | Tomesphere