A Hierarchical Spatiotemporal Action Tokenizer for In-Context Imitation Learning in Robotics

Fawad Javed Fateh; Ali Shah Ali; Murad Popattia; Usman Nizamani; Andrey Konin; M. Zeeshan Zia; Quoc-Huy Tran

arXiv:2604.15215·cs.RO·May 18, 2026

A Hierarchical Spatiotemporal Action Tokenizer for In-Context Imitation Learning in Robotics

Fawad Javed Fateh, Ali Shah Ali, Murad Popattia, Usman Nizamani, Andrey Konin, M. Zeeshan Zia, Quoc-Huy Tran

PDF

TL;DR

This paper introduces a hierarchical spatiotemporal action tokenizer, HiST-AT, that improves in-context imitation learning for robotics by multi-level clustering of actions and timestamps, achieving state-of-the-art results.

Contribution

The paper proposes a novel hierarchical spatiotemporal action tokenizer with multi-level clustering, enhancing imitation learning performance in robotics.

Findings

01

Outperforms non-hierarchical methods in action reconstruction.

02

Utilizes both spatial and temporal cues for better action understanding.

03

Achieves new state-of-the-art results on multiple benchmarks.

Abstract

We present a novel hierarchical spatiotemporal action tokenizer for in-context imitation learning. We first propose a hierarchical approach, which consists of two successive levels of vector quantization. In particular, the lower level assigns input actions to fine-grained subclusters, while the higher level further maps fine-grained subclusters to clusters. Our hierarchical approach outperforms the non-hierarchical counterpart, while mainly exploiting spatial information by reconstructing input actions. Furthermore, we extend our approach by utilizing both spatial and temporal cues, forming a hierarchical spatiotemporal action tokenizer, namely HiST-AT. Specifically, our hierarchical spatiotemporal approach conducts multi-level clustering, while simultaneously recovering input actions and their associated timestamps. Finally, extensive evaluations on multiple simulation and real…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.