What's in a Caption? Dataset-Specific Linguistic Diversity and Its   Effect on Visual Description Models and Metrics

David M. Chan; Austin Myers; Sudheendra Vijayanarasimhan; David A.; Ross; Bryan Seybold; John F. Canny

arXiv:2205.06253·cs.CV·January 16, 2023

What's in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics

David M. Chan, Austin Myers, Sudheendra Vijayanarasimhan, David A., Ross, Bryan Seybold, John F. Canny

PDF

Open Access 1 Repo

TL;DR

This paper investigates how dataset-specific linguistic patterns and diversity influence the performance and generalization of visual description models, revealing that current metrics and models favor generic captions due to limited diversity.

Contribution

The study analyzes linguistic diversity in caption datasets, demonstrating its impact on model performance and proposing methods to maintain diversity for better generalization.

Findings

01

Caption diversity affects the informativeness of generated captions.

02

State-of-the-art models can outperform ground truth captions on metrics due to dataset biases.

03

Limited linguistic diversity leads to models generating generic, uninformative descriptions.

Abstract

While there have been significant gains in the field of automated video description, the generalization performance of automated description models to novel domains remains a major barrier to using these systems in the real world. Most visual description methods are known to capture and exploit patterns in the training data leading to evaluation metric increases, but what are those patterns? In this work, we examine several popular visual description datasets, and capture, analyze, and understand the dataset-specific linguistic patterns that models exploit but do not generalize to new domains. At the token level, sample level, and dataset level, we find that caption diversity is a major driving factor behind the generation of generic and uninformative captions. We further show that state-of-the-art models even outperform held-out ground truth captions on modern metrics, and that this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cannylab/vdtk
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Video Analysis and Summarization