Towards Multimodal Vision-Language Models Generating Non-Generic Text
Wes Robbins, Zanyar Zohourianshahzadi, and Jugal Kalita

TL;DR
This paper enhances vision-language models by integrating auxiliary information like person names, enabling more specific and contextually relevant image captions, demonstrated through a new dataset and fine-tuning approach.
Contribution
It introduces a novel framework that incorporates auxiliary classifiers, such as facial recognition, into multimodal models for more detailed image captioning.
Findings
Models can incorporate person names into captions.
Fine-tuning on the PAC dataset improves caption specificity.
Baseline scores establish benchmarks for future work.
Abstract
Vision-language models can assess visual context in an image and generate descriptive text. While the generated text may be accurate and syntactically correct, it is often overly general. To address this, recent work has used optical character recognition to supplement visual information with text extracted from an image. In this work, we contend that vision-language models can benefit from additional information that can be extracted from an image, but are not used by current models. We modify previous multimodal frameworks to accept relevant information from any number of auxiliary classifiers. In particular, we focus on person names as an additional set of tokens and create a novel image-caption dataset to facilitate captioning with person names. The dataset, Politicians and Athletes in Captions (PAC), consists of captioned images of well-known people in context. By fine-tuning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Handwritten Text Recognition Techniques
