TL;DR
MolVision introduces a multimodal approach combining molecular images and text using vision-language models to improve molecular property prediction across diverse datasets.
Contribution
This work pioneers the integration of visual molecular representations with textual descriptions in VLMs for property prediction, demonstrating significant performance gains.
Findings
Visual information enhances prediction accuracy.
Multimodal fusion improves generalization.
Fine-tuning with LoRA boosts performance.
Abstract
Molecular property prediction is a fundamental task in computational chemistry with critical applications in drug discovery and materials science. While recent works have explored Large Language Models (LLMs) for this task, they primarily rely on textual molecular representations such as SMILES/SELFIES, which can be ambiguous and structurally less informative. In this work, we introduce MolVision, a novel approach that leverages Vision-Language Models (VLMs) by integrating both molecular structure as images and textual descriptions to enhance property prediction. We construct a benchmark spanning ten diverse datasets, covering classification, regression and description tasks. Evaluating nine different VLMs in zero-shot, few-shot, and fine-tuned settings, we find that visual information improves prediction performance, particularly when combined with efficient fine-tuning strategies such…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
