MolVision: Molecular Property Prediction with Vision Language Models

Deepan Adak; Yogesh Singh Rawat; Shruti Vyas

arXiv:2507.03283·cs.CV·July 8, 2025

MolVision: Molecular Property Prediction with Vision Language Models

Deepan Adak, Yogesh Singh Rawat, Shruti Vyas

PDF

1 Video

TL;DR

MolVision introduces a multimodal approach combining molecular images and text using vision-language models to improve molecular property prediction across diverse datasets.

Contribution

This work pioneers the integration of visual molecular representations with textual descriptions in VLMs for property prediction, demonstrating significant performance gains.

Findings

01

Visual information enhances prediction accuracy.

02

Multimodal fusion improves generalization.

03

Fine-tuning with LoRA boosts performance.

Abstract

Molecular property prediction is a fundamental task in computational chemistry with critical applications in drug discovery and materials science. While recent works have explored Large Language Models (LLMs) for this task, they primarily rely on textual molecular representations such as SMILES/SELFIES, which can be ambiguous and structurally less informative. In this work, we introduce MolVision, a novel approach that leverages Vision-Language Models (VLMs) by integrating both molecular structure as images and textual descriptions to enhance property prediction. We construct a benchmark spanning ten diverse datasets, covering classification, regression and description tasks. Evaluating nine different VLMs in zero-shot, few-shot, and fine-tuned settings, we find that visual information improves prediction performance, particularly when combined with efficient fine-tuning strategies such…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MolVision: Molecular Property Prediction with Vision Language Models· slideslive