What's in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics
David M. Chan, Austin Myers, Sudheendra Vijayanarasimhan, David A., Ross, Bryan Seybold, John F. Canny

TL;DR
This paper investigates how dataset-specific linguistic patterns and diversity influence the performance and generalization of visual description models, revealing that current metrics and models favor generic captions due to limited diversity.
Contribution
The study analyzes linguistic diversity in caption datasets, demonstrating its impact on model performance and proposing methods to maintain diversity for better generalization.
Findings
Caption diversity affects the informativeness of generated captions.
State-of-the-art models can outperform ground truth captions on metrics due to dataset biases.
Limited linguistic diversity leads to models generating generic, uninformative descriptions.
Abstract
While there have been significant gains in the field of automated video description, the generalization performance of automated description models to novel domains remains a major barrier to using these systems in the real world. Most visual description methods are known to capture and exploit patterns in the training data leading to evaluation metric increases, but what are those patterns? In this work, we examine several popular visual description datasets, and capture, analyze, and understand the dataset-specific linguistic patterns that models exploit but do not generalize to new domains. At the token level, sample level, and dataset level, we find that caption diversity is a major driving factor behind the generation of generic and uninformative captions. We further show that state-of-the-art models even outperform held-out ground truth captions on modern metrics, and that this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Video Analysis and Summarization
