TL;DR
ChemVLR is a novel chemical vision-language model that emphasizes explicit reasoning by analyzing visual inputs with detailed chemical descriptors, leading to interpretable solutions and state-of-the-art performance.
Contribution
Introduces ChemVLR, a chemical VLM that prioritizes reasoning through fine-grained analysis and a new training framework, outperforming existing models.
Findings
ChemVLR surpasses leading proprietary and open-source models in chemical visual reasoning tasks.
The model produces explicit, interpretable reasoning paths for complex chemical problems.
A large-scale dataset of 760k samples was curated for training and evaluation.
Abstract
While Vision-Language Models (VLMs) have demonstrated significant potential in chemical visual understanding, current models are predominantly optimized for direct visual question-answering tasks. This paradigm often results in "black-box" systems that fail to utilize the inherent capability of Large Language Models (LLMs) to infer underlying reaction mechanisms. In this work, we introduce ChemVLR, a chemical VLM designed to prioritize reasoning within the perception process. Unlike conventional chemical VLMs, ChemVLR analyzes visual inputs in a fine-grained manner by explicitly identifying granular chemical descriptors, such as functional groups, prior to generating answers. This approach ensures the production of explicit and interpretable reasoning paths for complex visual chemical problems. To facilitate this methodology, we implement a cross-modality reverse-engineering strategy,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
