Social-Transmotion: Promptable Human Trajectory Prediction
Saeed Saadatnejad, Yang Gao, Kaouther Messaoud, Alexandre, Alahi

TL;DR
Social-Transmotion is a Transformer-based model that leverages visual cues and prompts like coordinates and poses to improve human trajectory prediction, demonstrating flexibility and effectiveness across multiple datasets.
Contribution
We introduce a novel prompt-based Transformer model for human trajectory prediction that exploits diverse visual cues and masking techniques for improved accuracy.
Findings
Enhanced prediction accuracy on multiple datasets
Flexibility in using 2D and 3D pose prompts
Identification of key spatial and temporal cues
Abstract
Accurate human trajectory prediction is crucial for applications such as autonomous vehicles, robotics, and surveillance systems. Yet, existing models often fail to fully leverage the non-verbal social cues human subconsciously communicate when navigating the space. To address this, we introduce Social-Transmotion, a generic Transformer-based model that exploits diverse and numerous visual cues to predict human behavior. We translate the idea of a prompt from Natural Language Processing (NLP) to the task of human trajectory prediction, where a prompt can be a sequence of x-y coordinates on the ground, bounding boxes in the image plane, or body pose keypoints in either 2D or 3D. This, in turn, augments trajectory data, leading to enhanced human trajectory prediction. Using masking technique, our model exhibits flexibility and adaptability by capturing spatiotemporal interactions between…
Peer Reviews
Decision·ICLR 2024 poster
1) The paper is well written and easy to understand 2) The paper presents good ablative analysis based of the input modalities that are used for the task of trajectory prediction. 3) The paper evaluates results on various publically available datasets.
1) The paper lacks novelty as the use of transformers using multi-modal inputs for the task of trajecory prediction is already been studied extensively including works like Wayformer for example. 2) The paper does not use larger industrial-academix datasets like Argoverse or Waymo open motion dataset to compile results against other transformer based benchmarks popular in the field today. Given the above two major weaknesses, even with the nice experimental section and exhasutive results it is
- The model can handle different "types" of motion predictions [pose, position, bounding box] - The selective masking techniques which can be seen as a form of data balancing is being employed for better results - The latent input is a nice approach for these problems, similar encoding can be found in [1] where an encoder was used to generate a codebook. - The discussion section is rich. For example the analysis of imperfect data with degradation percentage is valuable for the domain. The mas
- In the introduction, it was mentioned that "traditional predictors have limited performance, as they typically rely on a single data point per person (i.e., their x-y coordinates on the ground) as input." This can not be a general statement as works such as [1] and following ones do tackle the point using such cues. Also, it seems this work is not mentioned in the related work section. - In the related work section there need to be a balance between the 3 modes supported in the work, where t
- Incorporation of Visual Cues: The model effectively utilizes additional visual cues like 3D poses and bounding boxes, which is a significant advancement over traditional trajectory-only models. - Robustness: The model demonstrates robust performance even when faced with incomplete or noisy input data, showcasing its reliability for real-world applications. - Performance: Social-Transmotion outperforms various state-of-the-art models and its own ablated versions, indicating its effectiveness
- Lack of Commonly Used Benchmarks: Some commonly used datasets are not used such as nuScenes, Agroverse 1/2, Waymo Open Motion Dataset, etc. These datasets are often used to evaluate the performance of trajectory prediction methods. - Complexity: The inclusion of various visual cues and a dual-transformer architecture might make the model computationally intensive, potentially limiting its applicability in resource-constrained environments. - Dependence on Accurate Pose Estimation: The model'
Code & Models
Videos
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Autonomous Vehicle Technology and Safety
MethodsSparse Evolutionary Training
