EgoLM: Multi-Modal Language Model of Egocentric Motions
Fangzhou Hong, Vladimir Guzov, Hyo Jin Kim, Yuting Ye, Richard, Newcombe, Ziwei Liu, Lingni Ma

TL;DR
EgoLM is a multi-modal language model that leverages egocentric videos and sensor data to improve motion tracking and understanding by integrating natural language processing with motion analysis.
Contribution
It introduces a novel framework that models the joint distribution of egocentric motions and language using large language models, enabling versatile multi-modal egocentric learning.
Findings
EgoLM outperforms existing methods in egocentric motion understanding.
The framework effectively integrates multi-modal inputs for improved disambiguation.
Experiments demonstrate EgoLM's versatility across large-scale datasets.
Abstract
As the prevalence of wearable devices, learning egocentric motions becomes essential to develop contextual AI. In this work, we present EgoLM, a versatile framework that tracks and understands egocentric motions from multi-modal inputs, e.g., egocentric videos and motion sensors. EgoLM exploits rich contexts for the disambiguation of egomotion tracking and understanding, which are ill-posed under single modality conditions. To facilitate the versatile and multi-modal framework, our key insight is to model the joint distribution of egocentric motions and natural languages using large language models (LLM). Multi-modal sensor inputs are encoded and projected to the joint latent space of language models, and used to prompt motion generation or text generation for egomotion tracking or understanding, respectively. Extensive experiments on large-scale multi-modal human motion dataset…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- This paper presents a new framework named EgoLM for understanding egocentric motion. - The figures are high-quality and aid in understanding the proposed approach.
- Limited technical novelty: Sections 3.2 and 3.3 are almost identical to T2M-GPT, as they discuss training motion VQ-VAE and training an autoregressive LM on motion tokens. This makes it challenging to consider these sections as novel contributions. Section 3.4, which introduces multi-modal multi-task training, is the only unique technical contribution; however, it merely combines different modalities to instruction-tune the LM without introducing new technices. - The proposed framework does no
1. The paper shows exhaustive evaluation with the unimodal baselines and compares both the multi-modal and uni-modal versions of the method. 2. The paper leverages multiple modalities of sensor data and visual data for better motion tracking. 3. The qualitative results are exhasutive.
1. How does the method deal with scenarios when both the video and sensor data become ambiguous? For example, if the person is using the treadmill and faces the white wall, can the sensor data confuse between walking and running? 2. How is the correspondence/association established between the two modalities - clip embedding and pose data?
EgoLM stands out for its versatility, unifying diverse egocentric motion tasks such as narration, generation, and text synthesis in a single framework. The model effectively combines video and sensor inputs, capturing rich contextual information that enhances motion understanding. Extensive experiments demonstrate EgoLM’s adaptability for wearable technology applications.
While I believe that using an egocentric body model with a language model is a promising research direction, I have concerns about its practicality in real-world applications due to potentially slower inference speeds. If inference time is slow, it may hinder real-time motion tracking capabilities, even with an online setup. Regarding motion tracking, state-of-the-art methods such as [1], [2], and [3] were available at the time of submission to ICLR. Comparing EgoLM's results to these more rece
+ The data focused on by this paper means a lot in the future world, as the VR equipment will be an important entry of human life data. + Reconstructing human motion from the sparse 6D pose data is non-trivial.
- The method contribution: most of the method designs in this work are common practice from the other VLM or MLLM works, like instruction tuning tricks, VQ-VAE encoding, and training, joint space training, multi-modal tokenization, etc. Please add a comparison between previous works, or which part is your unique contribution. - Though this work focuses on the sparse headset pose data, many existing 3D pose datasets on videos (like BABEL) can also be formalized as similar benchmarks. At least, t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Geographic Information Systems Studies
