AutoBio: A Simulation and Benchmark for Robotic Automation in Digital Biology Laboratory
Zhiqian Lan, Yuxuan Jiang, Ruiqi Wang, Xuanbing Xie, Rongkui Zhang, Yicheng Zhu, Peihang Li, Tianshuo Yang, Tianxing Chen, Haoyu Gao, Xiaokang Yang, Xuelong Li, Hongyuan Zhang, Yao Mu, Ping Luo

TL;DR
AutoBio introduces a comprehensive simulation framework and benchmark for evaluating robotic automation in biology laboratories, addressing the need for high-precision, multimodal robotic manipulation in scientific workflows.
Contribution
It provides a novel simulation environment with specialized physics and rendering for biology labs, enabling standardized evaluation of language-guided robotic tasks in scientific settings.
Findings
Baseline models show significant gaps in precision and visual reasoning.
The benchmark covers tasks of varying difficulty levels.
AutoBio facilitates reproducible research in robotic laboratory automation.
Abstract
Vision-language-action (VLA) models have shown promise as generalist robotic policies by jointly leveraging visual, linguistic, and proprioceptive modalities to generate action trajectories. While recent benchmarks have advanced VLA research in domestic tasks, professional science-oriented domains remain underexplored. We introduce AutoBio, a simulation framework and benchmark designed to evaluate robotic automation in biology laboratory environments--an application domain that combines structured protocols with demanding precision and multimodal interaction. AutoBio extends existing simulation capabilities through a pipeline for digitizing real-world laboratory instruments, specialized physics plugins for mechanisms ubiquitous in laboratory workflows, and a rendering stack that support dynamic instrument interfaces and transparent materials through physically based rendering. Our…
Peer Reviews
Decision·ICLR 2026 Poster
- Useful task suite in an area of robotic manipulation with few realistic benchmarks - Careful consideration of physics, rendering, assets, etc in the context of biology tasks - VLA and IL baselines are relevant and highlight weaknesses in more complex tasks - The presentation is clear and contributions well-explained
- Seeing as the realistic assets, physics, and rendering are a central focus, validation on a real robot setup (even on the simpler tasks) would support claims of realism - The paper notes VLAs may perform well as multi-task agents in the discussion, however this setting is not evaluated
Overall, the paper provides clear motivation for a focused effort on targeting simulation for the biology lab use-case, a significant area of both research and industrial importance and therefore has the potential for high impact. The work is a high-quality, well-executed set of improvements targeting the precise difficulties of simulation in this domain. In particular, the improvements to base Mujoco (both physics and rendering) are extremely relevant to the domain. In particular, interacting
Especially given that most of the work is aimed at achieving stronger realism, the paper would be strongly improved by any real-world experiments demonstrating that the methods result in transfer onto real robotic hardware. A significant aspect of the domain is executing longer-term procedures, and much of the work is motivated by the long task horizons. However, the main text has minimal emphasis on this aspect in the experiments / tasks, and the long-horizon task in the appendix is simply con
### Quality - Design makes sense and is well-explained. - Task difficulty clearly matters and is differentiating, which is key - I find this to be the hallmark of a good benchmark - Failure analysis is really promising (see weaknesses for thoughts on how to improve, but great addition) - Experiment suite is very good! I especially appreciate the holdout experiments, both because I feel it's important to fundamental ML science, and because I suspect this is a domain where there will be plenty o
### Quality - Some simulation features presented do already exist, though I suspect not with the fidelity/design choices needed here, meaning they do not take away from the contributions of this paper. However, they would be worth adding to RW: transparency and liquids are simulated in multiple simulators including OmniGibson (BEHAVIOR-1K, Li et al. 2022), and a threading mechanism is used in TRANSIC (Jiang et al., 2024). - Failure analysis needs more detail. Would benefit from examples of fail
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Machine Learning in Materials Science · Robot Manipulation and Learning
