Loading paper
Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning | Tomesphere