TL;DR
L2M3OF is a multimodal large language model designed for understanding and discovering Metal-Organic Frameworks by integrating structural and textual data, outperforming existing language models in property prediction and knowledge generation.
Contribution
This paper introduces L2M3OF, the first multimodal LLM for MOFs, combining crystal structure learning with language understanding to enhance materials discovery.
Findings
L2M3OF outperforms state-of-the-art closed-source LLMs in property prediction.
The model effectively integrates structural and textual information.
L2M3OF requires fewer parameters than comparable models.
Abstract
Large language models have demonstrated remarkable reasoning capabilities across diverse natural language tasks. However, comparable breakthroughs in scientific discovery are more limited, because understanding complex physical phenomena demands multifaceted representations far beyond language alone. A compelling example is the design of functional materials such as MOFs-critical for a range of impactful applications like carbon capture and hydrogen storage. Navigating their vast and intricate design space in language-based representations interpretable by LLMs is challenging due to the numerous possible three-dimensional atomic arrangements and strict reticular rules of coordination geometry and topology. Despite promising early results in LLM-assisted discovery for simpler materials systems, MOF design remains heavily reliant on tacit human expertise rarely codified in textual…
Peer Reviews
Decision·Submitted to ICLR 2026
The strengths of this work are as follows below: Addresses a neglected modality gap, integrating structure and language for reticular materials. Substantial new dataset (MOF-SPK) of >133k entries with curated properties and literature links. Systematic evaluation across four tasks with both open and closed LLMs. Methodologically clean hybrid (frozen encoder + lightweight bridge). Joint-training ablation (Table 3) convincingly shows cross-task synergy.
The weaknesses of this work are as follows below: There are reproducibility issues because dataset and code are not released; MOF-SPK curation process is only briefly described. There is evaluation bias in comparing to closed models (GPT-5, Gemini) without uniform prompt design or temperature settings limits validity. Because the projection-bridge multimodal alignment is standard, there is little architectural innovation beyond dataset scale. The writing of the work can also be improved to
**Goals and novelty** - The paper tackles an interesting and potentially important task of serving as an AI assistant for researchers investigating MOFs - The approach of using a structure encoder to extract material representations combined with a natural language LLM for parsing questions is interesting and to my knowledge novel. **Dataset** - The dataset curated by the authors is potentially useful for other researchers in this area **Writing** - The writing of the paper is generally clear
**Novelty** - The L2M2OF model is just a fine-tuned LLM, which is not particularly novel from a machine learning perspective - The L2M3OF model underperforms L2M2OF across many tasks and its unclear if it's actually better on many of the higher level tasks. **Evaluation** - The only baseline are non-fine tuned LLMs - The description and Q&A tasks are evaluated using another non-fine-tuned LLM as the judge, which is potentially biased and inaccurate as a metric. - The ground truth labels for the
- The approach of applying language models to MOF analysis is reasonable and timely (but has been done before) - The paper develops both multimodal and text-only variants for comparison
**1. Poor absolute performance on basic properties:** The model's performance on fundamental geometric properties is concerning. For accessible surface area, the MAE exceeds 250 m²/g (L²M³OF) and approaches 500 m²/g (L²M²OF). These errors are substantial and raise serious questions about practical utility. The model appears good only because of the baselines chosen. **2. Inappropriate baseline selection:** The critical flaw is comparing exclusively against general-purpose LLMs (GPT-5, Gemini, D
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
