Loading paper
Learning to Generate Grounded Visual Captions without Localization Supervision | Tomesphere