Predicting Winning Captions for Weekly New Yorker Comics
Stanley Cao, Sonny Young

TL;DR
This paper develops vision transformer-based models to generate humorous captions for New Yorker cartoons, aiming to emulate winning entries and enhance understanding of visual humor and cultural nuances.
Contribution
It introduces new baseline models using vision transformer encoder-decoder architectures tailored for captioning humorous cartoons.
Findings
Proposed vision transformer models effectively generate captions that capture humor.
Models outperform previous baselines in humor and relevance.
Demonstrated the importance of cultural context in caption quality.
Abstract
Image captioning using Vision Transformers (ViTs) represents a pivotal convergence of computer vision and natural language processing, offering the potential to enhance user experiences, improve accessibility, and provide textual representations of visual data. This paper explores the application of image captioning techniques to New Yorker cartoons, aiming to generate captions that emulate the wit and humor of winning entries in the New Yorker Cartoon Caption Contest. This task necessitates sophisticated visual and linguistic processing, along with an understanding of cultural nuances and humor. We propose several new baselines for using vision transformer encoder-decoder models to generate captions for the New Yorker cartoon caption contest.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComics and Graphic Narratives · Translation Studies and Practices
