PatchTraj: Unified Time-Frequency Representation Learning via Dynamic Patches for Trajectory Prediction

Yanghong Liu; Xingping Dong; Ming Li; Weixing Zhang; Yidong Lou

arXiv:2507.19119·cs.CV·August 1, 2025

PatchTraj: Unified Time-Frequency Representation Learning via Dynamic Patches for Trajectory Prediction

Yanghong Liu, Xingping Dong, Ming Li, Weixing Zhang, Yidong Lou

PDF

Open Access 4 Reviews

TL;DR

PatchTraj introduces a novel dynamic patch-based framework that jointly models time and frequency components of trajectories, capturing hierarchical motion patterns for improved prediction accuracy in autonomous driving and robotics.

Contribution

The paper presents a unified time-frequency representation learning method using dynamic patches, enhancing trajectory prediction by capturing multi-scale motion dynamics and integrating spectral information.

Findings

01

Achieves state-of-the-art performance on multiple datasets.

02

Significant improvements in ADE and FDE on JRDB dataset.

03

Effective fusion of temporal and spectral cues via cross-modal attention.

Abstract

Pedestrian trajectory prediction is crucial for autonomous driving and robotics. While existing point-based and grid-based methods expose two main limitations: insufficiently modeling human motion dynamics, as they fail to balance local motion details with long-range spatiotemporal dependencies, and the time representations lack interaction with their frequency components in jointly modeling trajectory sequences. To address these challenges, we propose PatchTraj, a dynamic patch-based framework that integrates time-frequency joint modeling for trajectory prediction. Specifically, we decompose the trajectory into raw time sequences and frequency components, and employ dynamic patch partitioning to perform multi-scale segmentation, capturing hierarchical motion patterns. Each patch undergoes adaptive embedding with scale-aware feature extraction, followed by hierarchical feature…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

1. The idea of fusing time and frequency is well motivated and intuitive. 2. The architecture is clearly presented and well motivated. 3. The results across multiple datasets are strong and state of the art.

Weaknesses

1. The method put together several known ideas into one large system, laying out this multi-module pipeline. It’s more like a comprehensive engineering package than an innovative idea. 2. The ablation study shows additive gains as each module is turned on. It doesn’t show which subset has better accuracy per cost tradeoff. If we add more modules, would the performance be even better? In this sense, we’d like to understand how much additional cost, e.g. inference latency, the pipeline incurs rela

Reviewer 02Rating 2Confidence 5

Strengths

- The paper is clearly written. The problem statement, motivation and solution are explained well. - The core idea is novel. The experimental setup is described clearly and organized coherently. - Hyperparameter choices and implementation details are provided in the appendix, which, together with the released code, supports reproducibility.

Weaknesses

- The authors claim that incorporating frequency data improves long-term dependency modeling, yet the FDE gains from adding the F-branch (Table 2) are limited compared to this claim. - The paper emphasizes noise robustness, but this is not evaluated. - More ablations would be beneficial: variants that (i) assign patch sizes randomly and (ii) restrict to a small set of reasonable patch sizes.

Reviewer 03Rating 4Confidence 4

Strengths

1. The model achieves state-of-the-art performance on four diverse datasets, outperforming recent baselines like NMRF (2025). It's interesting to see that both the ADE/FDE and the JADE/JFDE are reported for some dataset. 2. A detailed ablation study systematically evaluates the contribution of each major component of their proposed architecture, and the paper discusses the effect of K and joint loss variants. 3. The motivation to better unify local and global context and to explore the underutil

Weaknesses

1. The presentation needs a lot of improvement. E.g., in Figure 1, it's really difficult to understand the proposed Dynamic patching. What are s1, s2, and sM? What's the meaning of different rectangles? In row 083, what is DPAttn? The author should at least show the full name of the module when using it for the first time. Figure 2 is inconsistent; it is very difficult to understand how the two proposed branches are used in the right high-level architecture. What is the dynamic patch in t

Reviewer 04Rating 4Confidence 4

Strengths

- This study faithfully followed the experimental protocols of existing human trajectory prediction models. It utilized major benchmark datasets (JRDB, NBA, SDD, ETH-UCY) and adopted standard evaluation metrics (e.g., ADE, FDE), with results effectively visualized. - Compared to prior methods, it achieved state-of-the-art performance and enhanced credibility by providing the implementation code in the supplementary materials. Additionally, extensive ablation studies on model architecture, loss

Weaknesses

- While the paper compares PatchTraj with general human trajectory prediction methods, it lacks comparative experiments with conceptually or methodologically similar approaches. The authors mention TimesNet in line 145, yet no comparison is made with other models that also learn in the frequency domain (e.g., those based on FFT). Including such experiments would help validate the logical rationale for employing the Discrete Cosine Transform (DCT). - Furthermore, the study provides insufficient

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTraffic Prediction and Management Techniques · Autonomous Vehicle Technology and Safety · Speech Recognition and Synthesis