Understanding Cross-modal Interactions in V&L Models that Generate Scene   Descriptions

Michele Cafagna; Kees van Deemter; Albert Gatt

arXiv:2211.04971·cs.CL·November 11, 2022

Understanding Cross-modal Interactions in V&L Models that Generate Scene Descriptions

Michele Cafagna, Kees van Deemter, Albert Gatt

PDF

Open Access

TL;DR

This paper investigates how Vision and Language models can generate scene-level descriptions, revealing that minimal fine-tuning enables models to produce holistic scene summaries without sacrificing object recognition.

Contribution

It introduces a new dataset pairing object-centric and scene descriptions and demonstrates that small curated data enables models to generate scene descriptions effectively.

Findings

01

Small curated datasets suffice for scene description generation.

02

Models can produce holistic scene descriptions without losing object recognition.

03

Insights align with cognitive science research on scene perception.

Abstract

Image captioning models tend to describe images in an object-centric way, emphasising visible objects. But image descriptions can also abstract away from objects and describe the type of scene depicted. In this paper, we explore the potential of a state-of-the-art Vision and Language model, VinVL, to caption images at the scene level using (1) a novel dataset which pairs images with both object-centric and scene descriptions. Through (2) an in-depth analysis of the effect of the fine-tuning, we show (3) that a small amount of curated data suffices to generate scene descriptions without losing the capability to identify object-level concepts in the scene; the model acquires a more holistic view of the image compared to when object-centric descriptions are generated. We discuss the parallels between these results and insights from computational and cognitive science research on scene…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization