HiT-JEPA: A Hierarchical Self-supervised Trajectory Embedding Framework for Similarity Computation
Lihuan Li, Hao Xue, Shuang Ao, Yang Song, Flora Salim

TL;DR
HiT-JEPA introduces a hierarchical self-supervised framework that learns multi-scale urban trajectory representations, effectively capturing both local details and global semantics for improved similarity computation.
Contribution
It presents a novel hierarchical architecture that unifies fine-grained and high-level trajectory features in a single model, advancing trajectory representation learning.
Findings
Outperforms existing methods on real-world datasets
Provides richer, multi-scale trajectory embeddings
Demonstrates effectiveness in similarity computation
Abstract
The representation of urban trajectory data plays a critical role in effectively analyzing spatial movement patterns. Despite considerable progress, the challenge of designing trajectory representations that can capture diverse and complementary information remains an open research problem. Existing methods struggle in incorporating trajectory fine-grained details and high-level summary in a single model, limiting their ability to attend to both long-term dependencies while preserving local nuances. To address this, we propose HiT-JEPA (Hierarchical Interactions of Trajectory Semantics via a Joint Embedding Predictive Architecture), a unified framework for learning multi-scale urban trajectory representations across semantic abstraction levels. HiT-JEPA adopts a three-layer hierarchy that progressively captures point-level fine-grained details, intermediate patterns, and high-level…
Peer Reviews
Decision·Submitted to ICLR 2026
S1. The paper is easy to read, and the argument that existing methods lack hierarchical representations is both interesting and important. S2. The methodology, although based on techniques used in previous studies, is sound and well-justified.
W1. Although it is interesting to consider different levels of trajectory semantics, the authors do not provide sufficient experimental evidence to show the effectiveness of their approach. For example, it is unclear which representation layers (S1, S2, S3, or their combination) are used for the trajectory similarity search experiments. These details appear to be missing. If the proposed framework indeed works as claimed, the different representations should capture distinct semantic meanings of
S1. HiT-JEPA proposes an explicit three-layer architecture that learns and aligns representations at point, segment, and trajectory levels. S2. The method implements a novel attention propagation strategy between abstraction levels. And it upsamples higher-level attention maps to guide feature extraction at lower levels. S3. HiT-JEPA demonstrates exceptional zero-shot performance across heterogeneous datasets. Experimental design is rigorous and covers critical scenarios like self-similarity,
W1. This manuscript has presented limited novelty comparing to prior works. The JEPA and multi-scale structure build heavily on established elements in self-supervised learning, attention propagation, and urban trajectory modeling. The direct design leap over T-JEPA, is incremental rather than a decisive paradigm shift, especially given already cited work such as HIBERT and HiCLRE in NLP and CV. So, this work simply applies these models to the trajectory dataset without making more targeted desi
- Address single-scale bias by introducing an explicit multi-scale hierarchy with information flow across levels. - Propose a JEPA-style training reduces reliance on heavy data augmentation; ablations indicate the benefit of hierarchical interactions. - Extensive experiments across datasets and zero-shot settings supports generalization claims.
- Only two contrastive learning baselines are compared in the experiments. There are many trajectory representation learning works should be compared. - The interpretations of the results of HiT-JEPA are quite confusing. It would be better to explain why the two showcases can illustrate the interpretation of the proposed method. - For different trajectory abstraction method, the results would change. From equation (3)-(5), the sampling rate are 50% and 25% of orginal trajectories which may not
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Human Mobility and Location-Based Analysis · Automated Road and Building Extraction
