SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens

Anindita Ghosh; Vladislav Golyanik; Taku Komura; Philipp Slusallek; Christian Theobalt; Rishabh Dabral

arXiv:2602.20476·cs.CV·February 25, 2026

SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens

Anindita Ghosh, Vladislav Golyanik, Taku Komura, Philipp Slusallek, Christian Theobalt, Rishabh Dabral

PDF

Open Access

TL;DR

SceMoS introduces a scene-aware 3D human motion synthesis method that leverages 2D scene representations for efficient and realistic motion planning and execution, reducing reliance on expensive 3D scene data.

Contribution

The paper presents SceMoS, a novel framework that uses 2D scene cues for physically grounded motion synthesis, outperforming methods relying on full 3D supervision.

Findings

01

Achieves state-of-the-art realism and contact accuracy on TRUMANS benchmark.

02

Reduces scene encoding parameters by over 50%.

03

Effectively grounds 3D human-scene interaction using 2D cues.

Abstract

Synthesizing text-driven 3D human motion within realistic scenes requires learning both semantic intent ("walk to the couch") and physical feasibility (e.g., avoiding collisions). Current methods use generative frameworks that simultaneously learn high-level planning and low-level contact reasoning, and rely on computationally expensive 3D scene data such as point clouds or voxel occupancy grids. We propose SceMoS, a scene-aware motion synthesis framework that shows that structured 2D scene representations can serve as a powerful alternative to full 3D supervision in physically grounded motion synthesis. SceMoS disentangles global planning from local execution using lightweight 2D cues and relying on (1) a text-conditioned autoregressive global motion planner that operates on a bird's-eye-view (BEV) image rendered from an elevated corner of the scene, encoded with DINOv2 features, as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · 3D Shape Modeling and Analysis · Generative Adversarial Networks and Image Synthesis