Learning Priors of Human Motion With Vision Transformers
Placido Falqueto, Alberto Sanfeliu, Luigi Palopoli, Daniele Fontanelli

TL;DR
This paper introduces a Vision Transformer-based neural architecture to predict human motion patterns, demonstrating improved accuracy over CNN-based methods in urban mobility and navigation applications.
Contribution
The paper presents a novel ViT architecture for modeling human motion priors, outperforming CNN-based approaches in predictive accuracy.
Findings
ViT architecture achieves higher accuracy than CNNs.
Improved spatial correlation modeling with ViTs.
Effective on standard human motion datasets.
Abstract
A clear understanding of where humans move in a scenario, their usual paths and speeds, and where they stop, is very important for different applications, such as mobility studies in urban areas or robot navigation tasks within human-populated environments. We propose in this article, a neural architecture based on Vision Transformers (ViTs) to provide this information. This solution can arguably capture spatial correlations more effectively than Convolutional Neural Networks (CNNs). In the paper, we describe the methodology and proposed neural architecture and show the experiments' results with a standard dataset. We show that the proposed ViT architecture improves the metrics compared to a method based on a CNN.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Robot Manipulation and Learning
