SIM-Net: A Multimodal Fusion Network Using Inferred 3D Object Shape Point Clouds from RGB Images for 2D Classification

Youcef Sklab; Hanane Ariouat; Eric Chenin; Edi Prifti; Jean-Daniel Zucker

arXiv:2506.18683·cs.CV·June 24, 2025

SIM-Net: A Multimodal Fusion Network Using Inferred 3D Object Shape Point Clouds from RGB Images for 2D Classification

Youcef Sklab, Hanane Ariouat, Eric Chenin, Edi Prifti, Jean-Daniel Zucker

PDF

TL;DR

SIM-Net introduces a novel multimodal architecture that infers 3D point clouds from RGB images to improve 2D classification, especially for challenging herbarium specimen images with occlusions and heterogeneous backgrounds.

Contribution

The paper presents a new pixel-to-point transformation method and a fusion architecture combining 2D and 3D features for enhanced image classification performance.

Findings

01

Outperforms ResNet101 with up to 9.9% accuracy gain

02

Achieves 12.3% higher F-score over baseline models

03

Surpasses transformer-based architectures in herbarium specimen classification

Abstract

We introduce the Shape-Image Multimodal Network (SIM-Net), a novel 2D image classification architecture that integrates 3D point cloud representations inferred directly from RGB images. Our key contribution lies in a pixel-to-point transformation that converts 2D object masks into 3D point clouds, enabling the fusion of texture-based and geometric features for enhanced classification performance. SIM-Net is particularly well-suited for the classification of digitized herbarium specimens (a task made challenging by heterogeneous backgrounds), non-plant elements, and occlusions that compromise conventional image-based models. To address these issues, SIM-Net employs a segmentation-based preprocessing step to extract object masks prior to 3D point cloud generation. The architecture comprises a CNN encoder for 2D image features and a PointNet-based encoder for geometric features, which are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.