$x^2$-Fusion: Cross-Modality and Cross-Dimension Flow Estimation in Event Edge Space
Ruishan Guo, Ciyu Ruan, Haoyang Wang, Zihang Gong, Jingao Xu, Xinlei Chen

TL;DR
This paper introduces $x^2$-Fusion, a unified framework that aligns multiple sensor modalities in an intrinsic edge space derived from event camera data, enabling improved dense 2D and 3D flow estimation.
Contribution
It proposes a novel Event Edge Space for cross-modality fusion, integrating image, LiDAR, and event data into a shared latent space for better scene flow estimation.
Findings
Achieves state-of-the-art accuracy on benchmarks.
Improves robustness in challenging scenarios.
Effectively couples 2D and 3D flow estimation.
Abstract
Estimating dense 2D optical flow and 3D scene flow is essential for dynamic scene understanding. Recent work combines images, LiDAR, and event data to jointly predict 2D and 3D motion, yet most approaches operate in separate heterogeneous feature spaces. Without a shared latent space that all modalities can align to, these systems rely on multiple modality-specific blocks, leaving cross-sensor mismatches unresolved and making fusion unnecessarily complex.Event cameras naturally provide a spatiotemporal edge signal, which we can treat as an intrinsic edge field to anchor a unified latent representation, termed the Event Edge Space. Building on this idea, we introduce -Fusion, which reframes multimodal fusion as representation unification: event-derived spatiotemporal edges define an edge-centric homogeneous space, and image and LiDAR features are explicitly aligned in this shared…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Advanced Memory and Neural Computing · Advanced Neural Network Applications
