HyperPointFormer: Multimodal Fusion in 3D Space with Dual-Branch Cross-Attention Transformers

Aldino Rizaldy; Richard Gloaguen; Fabian Ewald Fassnacht; Pedram Ghamisi

arXiv:2505.23206·cs.CV·November 25, 2025

HyperPointFormer: Multimodal Fusion in 3D Space with Dual-Branch Cross-Attention Transformers

Aldino Rizaldy, Richard Gloaguen, Fabian Ewald Fassnacht, Pedram Ghamisi

PDF

1 Repo

TL;DR

This paper introduces HyperPointFormer, a novel 3D multimodal fusion method using dual-branch cross-attention transformers that directly learns from raw point clouds, enabling flexible 3D predictions and improved land-cover classification.

Contribution

It presents a fully 3D-based multimodal fusion approach with a dual-branch transformer and cross-attention mechanism, advancing beyond traditional 2D rasterization methods.

Findings

01

3D fusion achieves competitive land-cover classification results.

02

The method provides flexible 3D predictions that can be projected onto 2D maps.

03

The approach outperforms some existing 2D methods on benchmark datasets.

Abstract

Multimodal remote sensing data, including spectral and lidar or photogrammetry, is crucial for achieving satisfactory land-use / land-cover classification results in urban scenes. So far, most studies have been conducted in a 2D context. When 3D information is available in the dataset, it is typically integrated with the 2D data by rasterizing the 3D data into 2D formats. Although this method yields satisfactory classification results, it falls short in fully exploiting the potential of 3D data by restricting the model's ability to learn 3D spatial features directly from raw point clouds. Additionally, it limits the generation of 3D predictions, as the dimensionality of the input data has been reduced. In this study, we propose a fully 3D-based method that fuses all modalities within the 3D point cloud and employs a dedicated dual-branch Transformer model to simultaneously learn…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aldinorizaldy/hyperpointformer
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Multi-Head Attention · Layer Normalization · Byte Pair Encoding