Oracle performance for visual captioning
Li Yao, Nicolas Ballas, Kyunghyun Cho, John R. Smith, Yoshua Bengio

TL;DR
This paper establishes performance upper bounds for visual captioning datasets by decomposing the task into visual concept extraction and language modeling, revealing that current models are far from these bounds and highlighting dataset and model limitations.
Contribution
It introduces a method to empirically estimate upper performance bounds for visual captioning without additional labeling or human evaluation, providing a benchmark for future improvements.
Findings
Current models are significantly below the upper bounds.
The number of visual concepts captured affects captioning quality.
Dataset difficulty varies and impacts model performance.
Abstract
The task of associating images and videos with a natural language description has attracted a great amount of attention recently. Rapid progress has been made in terms of both developing novel algorithms and releasing new datasets. Indeed, the state-of-the-art results on some of the standard datasets have been pushed into the regime where it has become more and more difficult to make significant improvements. Instead of proposing new models, this work investigates the possibility of empirically establishing performance upper bounds on various visual captioning datasets without extra data labelling effort or human evaluation. In particular, it is assumed that visual captioning is decomposed into two steps: from visual inputs to visual concepts, and from visual concepts to natural language descriptions. One would be able to obtain an upper bound when assuming the first step is perfect and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
