TabletopGen: Instance-Level Interactive 3D Tabletop Scene Generation from Text or Single Image

Ziqian Wang; Yonghao He; Licheng Yang; Wei Zou; Hongxuan Ma; Liu Liu; Wei Sui; Yuxin Guo; Hu Su

arXiv:2512.01204·cs.CV·December 8, 2025

TabletopGen: Instance-Level Interactive 3D Tabletop Scene Generation from Text or Single Image

Ziqian Wang, Yonghao He, Licheng Yang, Wei Zou, Hongxuan Ma, Liu Liu, Wei Sui, Yuxin Guo, Hu Su

PDF

Open Access

TL;DR

TabletopGen is a training-free framework that generates diverse, high-fidelity, physically interactive 3D tabletop scenes from a reference image, improving scene realism and diversity for embodied AI applications.

Contribution

It introduces a novel pose and scale alignment method and a fully automatic pipeline for instance-level 3D scene generation from images or text.

Findings

01

Achieves state-of-the-art visual fidelity and layout accuracy.

02

Surpasses existing methods in physical plausibility and diversity.

03

Enables realistic and diverse tabletop scene generation.

Abstract

Generating high-fidelity, physically interactive 3D simulated tabletop scenes is essential for embodied AI -- especially for robotic manipulation policy learning and data synthesis. However, current text- or image-driven 3D scene generation methods mainly focus on large-scale scenes, struggling to capture the high-density layouts and complex spatial relations that characterize tabletop scenes. To address these challenges, we propose TabletopGen, a training-free, fully automatic framework that generates diverse, instance-level interactive 3D tabletop scenes. TabletopGen accepts a reference image as input, which can be synthesized by a text-to-image model to enhance scene diversity. We then perform instance segmentation and completion on the reference to obtain per-instance images. Each instance is reconstructed into a 3D model followed by canonical coordinate alignment. The aligned 3D…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · Generative Adversarial Networks and Image Synthesis · Human Motion and Animation