Predicting Winning Captions for Weekly New Yorker Comics

Stanley Cao; Sonny Young

arXiv:2407.18949·cs.CV·July 30, 2024

Predicting Winning Captions for Weekly New Yorker Comics

Stanley Cao, Sonny Young

PDF

Open Access

TL;DR

This paper develops vision transformer-based models to generate humorous captions for New Yorker cartoons, aiming to emulate winning entries and enhance understanding of visual humor and cultural nuances.

Contribution

It introduces new baseline models using vision transformer encoder-decoder architectures tailored for captioning humorous cartoons.

Findings

01

Proposed vision transformer models effectively generate captions that capture humor.

02

Models outperform previous baselines in humor and relevance.

03

Demonstrated the importance of cultural context in caption quality.

Abstract

Image captioning using Vision Transformers (ViTs) represents a pivotal convergence of computer vision and natural language processing, offering the potential to enhance user experiences, improve accessibility, and provide textual representations of visual data. This paper explores the application of image captioning techniques to New Yorker cartoons, aiming to generate captions that emulate the wit and humor of winning entries in the New Yorker Cartoon Caption Contest. This task necessitates sophisticated visual and linguistic processing, along with an understanding of cultural nuances and humor. We propose several new baselines for using vision transformer encoder-decoder models to generate captions for the New Yorker cartoon caption contest.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComics and Graphic Narratives · Translation Studies and Practices