EventVGGT: Exploring Cross-Modal Distillation for Consistent Event-based Depth Estimation
Yinrui Ren, Jinjing Zhu, Kanghao Chen, Zhuoxiao Li, Jing Ou, Zidong Cao, Tongyan Hua, Peilun Shi, Yingchun Fu, Wufan Zhao, Hui Xiong

TL;DR
EventVGGT introduces a novel spatio-temporal distillation framework for event-based depth estimation, leveraging temporal priors from VFMs to improve accuracy and consistency in challenging conditions.
Contribution
It is the first to distill spatio-temporal and multi-view geometric priors from VGGT into event-based depth estimation, explicitly modeling event streams as coherent video sequences.
Findings
Reduces depth error by over 53% at 30m on EventScape.
Outperforms existing methods in accuracy and temporal consistency.
Demonstrates robust zero-shot generalization on unseen datasets.
Abstract
Event cameras offer superior sensitivity to high-speed motion and extreme lighting, making event-based monocular depth estimation a promising approach for robust 3D perception in challenging conditions. However, progress is severely hindered by the scarcity of dense depth annotations. While recent annotation-free approaches mitigate this by distilling knowledge from Vision Foundation Models (VFMs), a critical limitation persists: they process event streams as independent frames. By neglecting the inherent temporal continuity of event data, these methods fail to leverage the rich temporal priors encoded in VFMs, ultimately yielding temporally inconsistent and less accurate depth predictions. To address this, we introduce EventVGGT, a novel framework that explicitly models the event stream as a coherent video sequence. To the best of our knowledge, we are the first to distill…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Advanced Memory and Neural Computing · Human Pose and Action Recognition
