AHA! Animating Human Avatars in Diverse Scenes with Gaussian Splatting
Aymen Mir, Jian Wang, Riza Alp Guler, Chuan Guo, Gerard Pons-Moll, Bing Zhou

TL;DR
This paper introduces a new method for animating humans in 3D scenes using Gaussian Splatting, enabling realistic, geometry-consistent free-viewpoint rendering and interaction without relying on explicit scene geometry.
Contribution
The paper pioneers the use of 3D Gaussian Splatting for human-scene animation, decoupling rendering from motion synthesis and allowing animation from monocular videos.
Findings
Achieves photorealistic free-viewpoint rendering of animated humans.
Enables natural human-scene interactions without explicit scene geometry.
Supports animation from monocular RGB videos.
Abstract
We present a novel framework for animating humans in 3D scenes using 3D Gaussian Splatting (3DGS), a neural scene representation that has recently achieved state-of-the-art photorealistic results for novel-view synthesis but remains under-explored for human-scene animation and interaction. Unlike existing animation pipelines that use meshes or point clouds as the underlying 3D representation, our approach introduces the use of 3DGS as the 3D representation for animating humans in scenes. By representing humans and scenes as Gaussians, our approach allows geometry-consistent free-viewpoint rendering of humans interacting with 3D scenes. Our key insight is that rendering can be decoupled from motion synthesis, and each sub-problem can be addressed independently without the need for paired human-scene data. Central to our method is a Gaussian-aligned motion module that synthesizes motion…
Peer Reviews
Decision·Submitted to ICLR 2026
As far as I know, this is the first work that models human-environment interactions in 3D Gaussian space. Although the interaction is quite basic (it cannot model intricate object manipulation), it is still impressive that the model can handle occlusion and collision. The video result quality is good and can inspire future works.
The critical weakness, in my opinion, is that there is no visual comparison (especially video comparison) with baselines. The authors attempted to design naive baselines (which I appreciate, since there were no previous 3D Gaussian works that enabled human-environment interaction), and the design is reasonable. However, they did not show any visual comparisons with them (i.e., baseline A and baseline B presented in the experiment section), and I think this is a significant weakness. Although rea
* The task of animating digital avatars in 3D scenes is important in various AR/VR applications, e.g., game engines. * Modeling 3D scenes and digital avatars seperately then combining them together in the test time is a reasonable choice which could be the use case in games where the digital avatars/3D scenes are replacable. * The writing and the entire pipeline are easy to follow.
* The interactions demonstrated in the paper are limited to two simple motions—walking in open space and sitting on a chair. Other common daily activities, such as running or walking at different speeds, reaching for objects, or lying on a bed, are not explored. * The motions shown in the supplementary videos, while physically plausible, still appear somewhat unnatural. For instance, the arms do not swing naturally or fluidly during walking sequences. * Since the motion synthesis module is optim
* To the best of my knowledge, the work for the first time proposed to simultaneously solve the task of motion generation in the environment and photorealistic rendering. * The work raises an interesting question: is it possible to generate human animation in the scene when both are presented with 3D Gaussians in contrast with meshes in previous works. * The approach demonstrates convincing qualitative results with sufficient visual realism.
* Limited novelty. While the results are good, the presented method is mostly an engineering pipeline based on the existing methods with "refinement stage" as the only novel part. The refinement stage is presented as a set of heuristics to polish the Gaussians' positions: the frames and Gaussians are selected based on predefined rules and thresholds, as well as the Gaussians' offset directions. * Continued reliance on meshes. While the work investigates the applicability of 3D Gaussians for moti
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · 3D Shape Modeling and Analysis · Generative Adversarial Networks and Image Synthesis
