Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs
Hao-Yu Hsu, Tianhang Cheng, Jing Wen, Alexander G. Schwing, Shenlong Wang

TL;DR
This paper presents IMU-to-4D, a novel framework that uses wearable inertial sensors and large language models to reconstruct human motion and scene layouts without visual data, addressing privacy and energy concerns.
Contribution
It introduces a new method that repurposes language models for non-visual 4D human-scene understanding using only wearable inertial sensors.
Findings
IMU-to-4D outperforms state-of-the-art cascaded pipelines in coherence and temporal stability.
The framework successfully predicts detailed 4D human motion and coarse scene structure.
Wearable sensors alone can support rich 4D understanding of human-scene interactions.
Abstract
Understanding human activities and their surrounding environments typically relies on visual perception, yet cameras pose persistent challenges in privacy, safety, energy efficiency, and scalability. We explore an alternative: 4D perception without vision. Its goal is to reconstruct human motion and 3D scene layouts purely from everyday wearable sensors. For this we introduce IMU-to-4D, a framework that repurposes large language models for non-visual spatiotemporal understanding of human-scene dynamics. IMU-to-4D uses data from a few inertial sensors from earbuds, watches, or smartphones and predicts detailed 4D human motion together with coarse scene structure. Experiments across diverse human-scene datasets show that IMU-to-4D yields more coherent and temporally stable results than SoTA cascaded pipelines, suggesting wearable motion sensors alone can support rich 4D understanding.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
