Seq-SG2SL: Inferring Semantic Layout from Scene Graph Through Sequence to Sequence Learning
Boren Li, Boyu Zhuang, Mingyang Li, Jian Gu

TL;DR
This paper introduces Seq-SG2SL, a seq-to-seq learning framework using Transformer models to generate semantic layouts from scene graphs, with a new evaluation metric called SLEU, demonstrating improved results on Visual Genome.
Contribution
The paper proposes a novel seq-to-seq approach for semantic layout prediction from scene graphs, introducing a new evaluation metric SLEU tailored for spatial accuracy.
Findings
Seq-SG2SL outperforms non-sequential graph convolution methods.
The new SLEU metric effectively evaluates spatial relationships.
Transformer-based model improves semantic layout generation accuracy.
Abstract
Generating semantic layout from scene graph is a crucial intermediate task connecting text to image. We present a conceptually simple, flexible and general framework using sequence to sequence (seq-to-seq) learning for this task. The framework, called Seq-SG2SL, derives sequence proxies for the two modality and a Transformer-based seq-to-seq model learns to transduce one into the other. A scene graph is decomposed into a sequence of semantic fragments (SF), one for each relationship. A semantic layout is represented as the consequence from a series of brick-action code segments (BACS), dictating the position and scale of each object bounding box in the layout. Viewing the two building blocks, SF and BACS, as corresponding terms in two different vocabularies, a seq-to-seq model is fittingly used to translate. A new metric, semantic layout evaluation understudy (SLEU), is devised to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
