Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation
Yu Zhao, Jianguo Wei, Zhichao Lin, Yueheng Sun, Meishan Zhang, Min, Zhang

TL;DR
This paper introduces Visual Spatial Description (VSD), a new image-to-text task focusing on spatial relationships between objects, supported by a dataset and models that generate accurate, human-like spatial descriptions.
Contribution
The paper presents the VSD task, a new dataset, and benchmark models, advancing image-to-text generation with a focus on spatial semantics and relationship classification.
Findings
Models produce accurate, human-like spatial descriptions.
Joint end-to-end architecture outperforms pipeline approaches.
VSRC enhances the quality of spatial descriptions.
Abstract
Image-to-text tasks, such as open-ended image captioning and controllable image description, have received extensive attention for decades. Here, we further advance this line of work by presenting Visual Spatial Description (VSD), a new perspective for image-to-text toward spatial semantics. Given an image and two objects inside it, VSD aims to produce one description focusing on the spatial perspective between the two objects. Accordingly, we manually annotate a dataset to facilitate the investigation of the newly-introduced task and build several benchmark encoder-decoder models by using VL-BART and VL-T5 as backbones. In addition, we investigate pipeline and joint end-to-end architectures for incorporating visual spatial relationship classification (VSRC) information into our model. Finally, we conduct experiments on our benchmark dataset to evaluate all our models. Results show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
MethodsVL-T5
