Loading paper
SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set | Tomesphere