Understanding Cross-modal Interactions in V&L Models that Generate Scene Descriptions
Michele Cafagna, Kees van Deemter, Albert Gatt

TL;DR
This paper investigates how Vision and Language models can generate scene-level descriptions, revealing that minimal fine-tuning enables models to produce holistic scene summaries without sacrificing object recognition.
Contribution
It introduces a new dataset pairing object-centric and scene descriptions and demonstrates that small curated data enables models to generate scene descriptions effectively.
Findings
Small curated datasets suffice for scene description generation.
Models can produce holistic scene descriptions without losing object recognition.
Insights align with cognitive science research on scene perception.
Abstract
Image captioning models tend to describe images in an object-centric way, emphasising visible objects. But image descriptions can also abstract away from objects and describe the type of scene depicted. In this paper, we explore the potential of a state-of-the-art Vision and Language model, VinVL, to caption images at the scene level using (1) a novel dataset which pairs images with both object-centric and scene descriptions. Through (2) an in-depth analysis of the effect of the fine-tuning, we show (3) that a small amount of curated data suffices to generate scene descriptions without losing the capability to identify object-level concepts in the scene; the model acquires a more holistic view of the image compared to when object-centric descriptions are generated. We discuss the parallels between these results and insights from computational and cognitive science research on scene…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
