Rethinking Point Clouds as Sequences: A Causal Next-Token Predictive Learning Framework

Yumeng Yao; Jingzhi Dong; Haowen Gu; Tao Chen; Zonghan Wu; Xiaoshui Huang; Yazhou Yao

arXiv:2605.17566·cs.CV·May 19, 2026

Rethinking Point Clouds as Sequences: A Causal Next-Token Predictive Learning Framework

Yumeng Yao, Jingzhi Dong, Haowen Gu, Tao Chen, Zonghan Wu, Xiaoshui Huang, Yazhou Yao

PDF

TL;DR

This paper introduces PointNTP, a novel causal, decoder-free pre-training framework for 3D point clouds that models structural dependencies directly in latent space using a Transformer-based sequence prediction approach.

Contribution

It reformulates point cloud pre-training as a causal next-token prediction task, moving away from reconstruction-based methods and enabling scalable, modality-agnostic learning.

Findings

01

Achieves state-of-the-art results on multiple 3D classification and segmentation benchmarks.

02

Demonstrates the effectiveness of causal latent prediction over traditional reconstruction methods.

03

Provides a simple, scalable paradigm for self-supervised learning in 3D point clouds.

Abstract

With the rapid progress of multimodal foundation models and predictive pre-training, an important open question is how to equip 3D point clouds with a pre-training paradigm that is better aligned with next-token and next-embedding learning. Existing point-cloud self-supervised methods are largely built on masked reconstruction or explicit geometric generation, and thus remain tied to input recovery rather than predictive dependency modeling. In this paper, we introduce PointNTP, which reformulates point cloud pre-training as a fully causal, decoder-free latent Next-Token Prediction problem. Specifically, each point cloud is first partitioned into local patches and serialized into a structured 3D token sequence according to patch-center geometry. The resulting sequence is then modeled by a causal Transformer under prefix-only conditioning, and trained with a shift-based prediction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.