Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text   Generation

Yu Zhao; Jianguo Wei; Zhichao Lin; Yueheng Sun; Meishan Zhang; Min; Zhang

arXiv:2210.11109·cs.CV·October 27, 2022·1 cites

Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation

Yu Zhao, Jianguo Wei, Zhichao Lin, Yueheng Sun, Meishan Zhang, Min, Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces Visual Spatial Description (VSD), a new image-to-text task focusing on spatial relationships between objects, supported by a dataset and models that generate accurate, human-like spatial descriptions.

Contribution

The paper presents the VSD task, a new dataset, and benchmark models, advancing image-to-text generation with a focus on spatial semantics and relationship classification.

Findings

01

Models produce accurate, human-like spatial descriptions.

02

Joint end-to-end architecture outperforms pipeline approaches.

03

VSRC enhances the quality of spatial descriptions.

Abstract

Image-to-text tasks, such as open-ended image captioning and controllable image description, have received extensive attention for decades. Here, we further advance this line of work by presenting Visual Spatial Description (VSD), a new perspective for image-to-text toward spatial semantics. Given an image and two objects inside it, VSD aims to produce one description focusing on the spatial perspective between the two objects. Accordingly, we manually annotate a dataset to facilitate the investigation of the newly-introduced task and build several benchmark encoder-decoder models by using VL-BART and VL-T5 as backbones. In addition, we investigate pipeline and joint end-to-end architectures for incorporating visual spatial relationship classification (VSRC) information into our model. Finally, we conduct experiments on our benchmark dataset to evaluate all our models. Results show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhaoyucs/vsd
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques

MethodsVL-T5