Fast 4D Mesh Generation by Spatio-Temporal Attention Chains
Dvir Samuel, Yuval Atzmon, Gal Chechik, Yoni Kasten

TL;DR
This paper introduces a fast, training-free 4D mesh generation method using Spatio-Temporal Attention Chains, significantly improving speed and scalability while enhancing temporal correspondence quality.
Contribution
The authors propose a novel framework that accelerates 4D mesh generation and scales to longer videos by leveraging latent space correspondences without explicit matching.
Findings
Generates 4D meshes in 9 seconds, 13x faster than previous methods.
Scales to videos 16 times longer without quality loss.
Enables competitive zero-shot 2D and 4D tracking and reliable camera estimation.
Abstract
4D mesh generation has recently emerged as a powerful paradigm for recovering dynamic 3D structure from videos, but existing methods remain slow, computationally expensive, and difficult to scale to longer sequences. We introduce a training-free approach that accelerates 4D mesh generation while improving temporal correspondence quality. Our key observation is that temporal correspondences emerge inside a 4D backbone long before its generated meshes become visually accurate. We exploit this with a general framework we call Spatio-Temporal Attention Chain which propagates information across space and time. Starting from vertices on an anchor mesh, the chain maps vertices to latent tokens. It then follows temporal correspondences in latent space, and recovers frame-specific vertices through latent-to-vertex attention. This design avoids expensive explicit matching while preserving anchor…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
