Interaction-Aware 4D Gaussian Splatting for Dynamic Hand-Object Interaction Reconstruction
Hao Tian, Chenyangguang Zhang, Rui Liu, Wen Shen, Xiaolin Qin

TL;DR
This paper introduces an interaction-aware 4D Gaussian Splatting method for reconstructing complex dynamic hand-object interactions without prior object models, improving structural clarity and motion modeling.
Contribution
It proposes novel interaction-aware Gaussian representations and a progressive optimization strategy to enhance dynamic scene reconstruction accuracy.
Findings
Outperforms existing 3D Gaussian Splatting methods.
Achieves state-of-the-art reconstruction quality.
Effectively models complex hand-object interactions.
Abstract
This paper focuses on a challenging setting of simultaneously modeling geometry and appearance of hand-object interaction scenes without any object priors. We follow the trend of dynamic 3D Gaussian Splatting based methods, and address several significant challenges. To model complex hand-object interaction with mutual occlusion and edge blur, we present interaction-aware hand-object Gaussians with newly introduced optimizable parameters aiming to adopt piecewise linear hypothesis for clearer structural representation. Moreover, considering the complementarity and tightness of hand shape and object shape during interaction dynamics, we incorporate hand information into object deformation field, constructing interaction-aware dynamic fields to model flexible motions. To further address difficulties in the optimization process, we propose a progressive strategy that handles dynamic…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Three separated GS models are used for modeling the hand, the object, and the background, which better handle HOI. And it is reasonable that the background GS model is updated less frequently in the test scenes of this paper. 2. Two learnable Gaussian parameters, including a refinement weight and a radius, are introduced to model interactions. 3. The proposed method outperforms a few baselines on several sequences selected from HO3D and HOI4D.
1. The comparisons between the baselines and the proposed method are not convincing. Among the four selected baselines, 4DGS, Deform3DGS, and SC-GS are designed for general dynamic scenes, and HOLD adopts the implicit representation and is designed for 3D reconstruction. To demonstrate the superiority of the proposed method, a GS-based method tailored for HOI should be included. 2. Although being better than the selected baselines, the qualitative results of the proposed method still exhibit a
1. The proposed method achieves superior performance in quantitative evaluations compared to existing approaches. 2. A diverse set of prior methods is evaluated against the proposed model. 3. New loss functions are introduced, and comprehensive ablation studies, especially on the Interaction Aware loss, validate their effectiveness. 4. The paper proposes a hand conditioned 3DGS formulation for object modeling.
1. There is no comparison with BIGS, which also reconstructs meshes from input video. 2. Although the authors claim not to rely on object priors (first contribution: without any object priors), they still use the 3D bounding box of the object: “In optimization, we utilize explicit 3D information provided by MANO parameters Romero et al. (2022) and the 3D object bounding boxes.” 3. In the novel view synthesis results, the second and third samples appear flipped. 4. The novel view synthesis res
1. The paper is well-written, easy to understand for readers. 2. The method is technically sound. Also, it achieves SOTA rendering performance on two common benchmarks, compared to existing 4D-GS-based methods.
1. The difference with previous hand-object reconstruction methods that use independent implicit fields to represent hand/object/background [1, 2] is not clear. Besides replacing SDF/Nerf with Gaussian Splatting, other losses and the optimization strategy are not new things. 2. Metrics on geometry reconstruction accuracy are not reported, like Chamfer Distance, MPJPE/MPVPE, and F-score, as in previous work [1]. 3. For reconstruction, comparison with many SOTA works is missed, including G-HOP[2
* Practical setup (no category-specific object priors) with a clear hand/object/background decomposition. * Progressive optimization and interaction-aware constraints appear to stabilize training and improve novel-view quality.
* Problem statement and method exposition are unclear; the motivation for the new parameters is thin and mostly intuitive. * Reliance on MANO vertices and an object 3D box undercuts the “no-prior” messaging; availability at test time is unclear. * Ablations don’t isolate the key interaction design (e.g., hand-conditioned object field) or the specific benefit of the new parameters.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Human Motion and Animation · Robot Manipulation and Learning
