TL;DR
MicroWorld enhances multimodal large language models' reasoning in microscopy by integrating a large, structured knowledge graph at inference time, significantly improving performance without domain-specific fine-tuning.
Contribution
MicroWorld introduces a novel framework that constructs a large-scale biomedical knowledge graph and uses it to augment reasoning in MLLMs without fine-tuning.
Findings
37.5% improvement on MicroVQA benchmark
6.0% performance gain on MicroBench
State-of-the-art results achieved
Abstract
Multimodal large language models (MLLMs) show remarkable potential for scientific reasoning, yet their performance in specialized domains such as microscopy remains limited by the scarcity of domain-specific training data and the difficulty of encoding fine-grained expert knowledge into model parameters. To bridge the gap, we introduce MicroWorld, a framework that constructs a multimodal attributed property graph (MAPG) from large-scale scientific image--caption corpora and leverages it to augment MLLM reasoning at inference time without any domain-specific fine-tuning. MicroWorld extracts biomedical entities and relations via scispaCy or LLM-based triplet mining, aligns images and entities in a shared embedding space using Qwen3-VL-Embedding, and assembles a knowledge graph comprising approximately 111K nodes and 346K typed edges spanning eight relation categories. At inference time, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
