SimpliHuMoN: Simplifying Human Motion Prediction
Aadya Agrawal, Alexander Schwing

TL;DR
SimpliHuMoN introduces a transformer-based model that simplifies human motion prediction by effectively capturing spatial and temporal dependencies, achieving state-of-the-art results across multiple benchmarks for pose, trajectory, and combined tasks.
Contribution
The paper presents a versatile, end-to-end transformer model that unifies different human motion prediction tasks without task-specific modifications.
Findings
Achieves state-of-the-art results on Human3.6M, AMASS, ETH-UCY, and 3DPW datasets.
Effectively captures spatial and temporal dependencies within motion sequences.
Handles pose-only, trajectory-only, and combined prediction tasks seamlessly.
Abstract
Human motion prediction combines the tasks of trajectory forecasting and human pose prediction. For each of the two tasks, specialized models have been developed. Combining these models for holistic human motion prediction is non-trivial, and recent methods have struggled to compete on established benchmarks for individual tasks. To address this, we propose a simple yet effective transformer-based model for human motion prediction. The model employs a stack of self-attention modules to effectively capture both spatial dependencies within a pose and temporal relationships across a motion sequence. This simple, streamlined, end-to-end model is sufficiently versatile to handle pose-only, trajectory-only, and combined prediction tasks without task-specific modifications. We demonstrate that this approach achieves state-of-the-art results across all tasks through extensive experiments on a…
Peer Reviews
Decision·Submitted to ICLR 2026
The experimental evaluation is thorough. The method is validated across multiple tasks and datasets, compared against numerous state-of-the-art (SOTA) approaches, and achieves competitive results. This work successfully demonstrates that a simple network architecture can effectively tackle this complex problem, offering a fresh and inspiring perspective for future research.
Multimodal modeling mechanism: The current approach uses only a simple type embedding to distinguish between trajectory and pose modalities, without explicitly modeling their underlying physical coupling (e.g., how gait influences arm swing). Prediction horizon: How much past observation is required, and how far into the future can the model reliably predict? How does performance degrade as the prediction horizon increases? Temporal jitter: Does the model suffer from jitter or unnatural motion
- **(S1) Unified and Versatile Architecture:** The model's key strength is its generality. A single, unified transformer architecture successfully handles pose, trajectory, and combined prediction without any task-specific modifications. This directly addresses the prevalent issue of fragmentation, where competing models are often hyper-specialized. - **(S2) Rigorous State-of-the-Art Evaluation:** The authors conduct a comprehensive and robust evaluation across a wide range of standard benchmark
### Major - **(W1) Ambiguous Multi-Modal Prediction Mechanism:** The method for generating K distinct future hypotheses is unclear. Section 2.3 mentions a linear projection creates K parallel branches, but the exact mechanism is not detailed. If this is a single, large linear layer, it is not obvious how this architecture efficiently scales to the K=20 proposals required for trajectory forecasting benchmarks. The paper needs to clarify if the output head's size is fixed or dynamic, and how it h
1. Breadth of evaluation across three settings (traj-only, pose-only, traj+pose) with consistent K-mode reporting and per-task K values. 2. Ablations exploring depth/width trade-offs and effect of multimodality (K>1 vs K=1). 3. The text is clearly written and easy to follow.
1. The biggest concern with this work is its novelty. The motivation of predicting global trajectory and full-body pose jointly (or condition one on the other) is not new [1],[2],[3],[4]. 2. The proposed model architecture is a standard decoder with learnable queries, limiting the complex social interaction among pedestrians. Following this limitation, the dataset used for pose prediction only contained up to three pedestrians, which cannot reflect complex social interactions in real life. I w
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Human Motion and Animation · Balance, Gait, and Falls Prevention
