FusionBERT: Multi-View Image-3D Retrieval via Cross-Attention Visual Fusion and Normal-Aware 3D Encoder
Wei Li, Yufan Ren, Hanqing Jiang, Jianhui Ding, Zhen Peng, Leman Feng, Yichun Shentu, Guoqiang Xu, Baigui Sun

TL;DR
FusionBERT introduces a multi-view visual fusion framework with cross-attention and normal-aware 3D encoding, significantly improving image-3D retrieval accuracy in realistic multi-view scenarios.
Contribution
It presents a novel multi-view visual aggregator and a normal-aware 3D encoder, enhancing feature fusion and geometric representation for retrieval tasks.
Findings
FusionBERT outperforms state-of-the-art models in retrieval accuracy.
Multi-view fusion improves robustness over single-view methods.
Normal-aware 3D encoding benefits textureless or degraded models.
Abstract
We propose FusionBERT, a novel multi-view visual fusion framework for image-3D multimodal retrieval. Existing image-3D representation learning methods predominantly focus on feature alignment of a single object image and its 3D model, limiting their applicability in realistic scenarios where an object is typically observed and captured from multiple viewpoints. Although multi-view observations naturally provide complementary geometric and appearance cues, existing multimodal large models rarely explore how to effectively fuse such multi-view visual information for better cross-modal retrieval. To address this limitation, we introduce a multi-view image-3D retrieval framework named FusionBERT, which innovatively utilizes a cross-attention-based multi-view visual aggregator to adaptively integrate features from multi-view images of an object. The proposed multi-view visual encoder fuses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
