HAMF: A Hybrid Attention-Mamba Framework for Joint Scene Context Understanding and Future Motion Representation Learning

Xiaodong Mei; Sheng Wang; Jie Cheng; Yingbing Chen; Dan Xu

arXiv:2505.15703·cs.CV·May 22, 2025

HAMF: A Hybrid Attention-Mamba Framework for Joint Scene Context Understanding and Future Motion Representation Learning

Xiaodong Mei, Sheng Wang, Jie Cheng, Yingbing Chen, Dan Xu

PDF

Open Access

TL;DR

HAMF introduces a hybrid attention and Mamba-based framework that jointly models scene context and future motion for improved autonomous driving trajectory prediction, achieving state-of-the-art results.

Contribution

The paper presents a novel joint scene understanding and motion prediction framework using attention mechanisms and a Mamba module, enhancing accuracy and diversity in forecasts.

Findings

01

Achieves state-of-the-art performance on Argoverse 2 benchmark.

02

Effectively combines self-attention and cross-attention for scene and motion modeling.

03

Demonstrates lightweight architecture with high accuracy.

Abstract

Motion forecasting represents a critical challenge in autonomous driving systems, requiring accurate prediction of surrounding agents' future trajectories. While existing approaches predict future motion states with the extracted scene context feature from historical agent trajectories and road layouts, they suffer from the information degradation during the scene feature encoding. To address the limitation, we propose HAMF, a novel motion forecasting framework that learns future motion representations with the scene context encoding jointly, to coherently combine the scene understanding and future motion state prediction. We first embed the observed agent states and map information into 1D token sequences, together with the target multi-modal future motion features as a set of learnable tokens. Then we design a unified Attention-based encoder, which synergistically combines…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Surveillance and Tracking Methods

MethodsSparse Evolutionary Training · Mamba: Linear-Time Sequence Modeling with Selective State Spaces