FusionBERT: Multi-View Image-3D Retrieval via Cross-Attention Visual Fusion and Normal-Aware 3D Encoder

Wei Li; Yufan Ren; Hanqing Jiang; Jianhui Ding; Zhen Peng; Leman Feng; Yichun Shentu; Guoqiang Xu; Baigui Sun

arXiv:2604.02583·cs.CV·April 6, 2026

FusionBERT: Multi-View Image-3D Retrieval via Cross-Attention Visual Fusion and Normal-Aware 3D Encoder

Wei Li, Yufan Ren, Hanqing Jiang, Jianhui Ding, Zhen Peng, Leman Feng, Yichun Shentu, Guoqiang Xu, Baigui Sun

PDF

TL;DR

FusionBERT introduces a multi-view visual fusion framework with cross-attention and normal-aware 3D encoding, significantly improving image-3D retrieval accuracy in realistic multi-view scenarios.

Contribution

It presents a novel multi-view visual aggregator and a normal-aware 3D encoder, enhancing feature fusion and geometric representation for retrieval tasks.

Findings

01

FusionBERT outperforms state-of-the-art models in retrieval accuracy.

02

Multi-view fusion improves robustness over single-view methods.

03

Normal-aware 3D encoding benefits textureless or degraded models.

Abstract

We propose FusionBERT, a novel multi-view visual fusion framework for image-3D multimodal retrieval. Existing image-3D representation learning methods predominantly focus on feature alignment of a single object image and its 3D model, limiting their applicability in realistic scenarios where an object is typically observed and captured from multiple viewpoints. Although multi-view observations naturally provide complementary geometric and appearance cues, existing multimodal large models rarely explore how to effectively fuse such multi-view visual information for better cross-modal retrieval. To address this limitation, we introduce a multi-view image-3D retrieval framework named FusionBERT, which innovatively utilizes a cross-attention-based multi-view visual aggregator to adaptively integrate features from multi-view images of an object. The proposed multi-view visual encoder fuses…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.