Synergizing Chemical and AI Communities for Advancing Laboratories of the Future
Saejin Oh, Xinyi Fang, I-Hsin Lin, Paris Dee, Christopher S. Dunham, Stacy M. Copp, Abigail G. Doyle, Javier Read de Alaniz, Mengyang Gu

TL;DR
This paper explores how AI and machine learning can transform modern chemical labs by speeding up experiments and data analysis.
Contribution
The paper introduces AI agents and ML models to streamline chemical research through predictive modeling and automation.
Findings
ML models can accelerate the design-build-test-learn cycle in chemical experiments.
AI agents can assist researchers in acquiring background knowledge and accelerating discovery.
Case studies demonstrate ML and AI reducing manual experiments and data analysis.
Abstract
The development of automated experimental facilities and the digitization of experimental data have introduced numerous opportunities to radically advance chemical laboratories. As many laboratory tasks involve predicting and understanding previously unknown chemical relationships, machine learning (ML) approaches trained on experimental data can substantially accelerate the conventional design-build-test-learn process. This outlook article aims to help chemists understand and begin to adopt ML predictive models for a variety of laboratory tasks, including experimental design, synthesis optimization, and materials characterization. Furthermore, this article introduces how artificial intelligence (AI) agents based on large language models can help researchers acquire background knowledge in chemical or data science and accelerate various aspects of the discovery process. We present three…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
1
2
3
4
5
6
7
8
9| Cheminformatics package | Languages | Strength |
|---|---|---|
| OpenBabel | C++, Python, Java | Format conversion, Structure search |
| RDKit | C++, Python | Molecular analysis, ML |
| CDK | Java | Computational chemistry, Bioinformatics |
| Predictive models | Python packages | R packages |
|---|---|---|
| Linear regression | scikit-learn | stats, |
| Tree-based models | scikit-learn, XGBoost | randomForest, |
| Gaussian processes | scikit-learn, GPyTorch | RobustGaSP, |
| Neural networks | PyTorch, | torch, |
- —National Science Foundation10.13039/100000001
- —National Science Foundation10.13039/100000001
- —University of California10.13039/100005595
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Machine Learning in Materials Science · Various Chemistry Research Topics
Introduction
Laboratory experiments are one of the most critical conduits to advance basic science and technology. In recent years, the field of chemistry has experienced numerous significant milestones in accelerating laboratory experiments with the introduction of critical techniques, including laboratory automation, high-performance computing, machine learning (ML) algorithms, and artificial intelligence (AI) agents based on large language models (LLMs). These advancements automate various laboratory processes, ranging from synthesis and purification to characterization and data analysis with minimal human intervention, stimulating the transition toward self-driving laboratories. ?−? ? ? ? ? ? ?
Figure shows a timeline of the introduction of selective high-throughput experimentation (HTE)/lab automation approaches, ML/AI algorithms, and LLMs over the past three decades. Although automated and self-driving laboratories are a relatively new concept, tools for tracking and cataloging data for experimentation, such as laboratory information management systems (LIMS)? and electronic laboratory notebooks (ELNs), ?,? were conceptualized 30–40 years ago. As data acquisition and processing became increasingly multistep and time-consuming, automated and parallel operations of HT experiments have evolved in different areas. ?,?,?
A brief timeline for the major developmental milestones of HTE/lab automation, ML/AI algorithms, and LLMs for the laboratories of the future.
HTE has been developed through the standardization and automation of laboratory workflows. In the 1980s, the integration of robotics and analytical tools enabled high-throughput screening in pharmaceutical research. ?,? The first commercial LIMS was released around the same time.? Also, the wide adoption of the 96-well plate in the early 1990s established a practical standard for parallel assays and screening, enabling reproducible experimentation in large sample volumes.? By the mid-1990s, the concept of ELN began to emerge? that enable researchers to capture their scientific observations and laboratory data electronically. In the early 2000s, automation expanded beyond screening, with reports of end-to-end systems integrating synthesis, purification, and characterization, marking a shift toward autonomous materials discovery workflows.? Additionally, the integration of active learning and Bayesian optimization into the process of Design of Experiments (DoE), ?,? and the emergence of lab automation hardware companies improved the efficiency in exploration of multidimensional parameter spaces. Finally, from the late 2010s, closed-loop, self-driving laboratories emerged, in which machine-learning models iteratively proposed and executed experiments with minimal human intervention. ?,?
The hardware of laboratory research has evolved along with the computational tools capable of powering the feedback loops that guide operations. Linear regression or linear models, for instance, have been widely used as a tool of chemometrics in 1980s.? Algorithms, such as backpropagation,? one of the most useful approaches to optimize artificial neural networks,? were formed in the 1970s and formally introduced in the early 1980s for building an autoencoder, a fundamental approach to represent latent information between inputs and outputs.? The 1990s saw the development of kernel methods, such as support vector machines (SVMs),? to encode similarities of multidimensional inputs by kernels for classification. The early 2000s saw the formalization of ensemble tree techniques, such as random forests, and probabilistic models, including Gaussian processes, for nonlinear regression and classification problems with small to moderate data sizes. ?−? ? ? ? With the arrival of massive data collections of text and images on the Internet, different architectures of neural networks, such as convolutional neural networks and recurrent neural networks, were developed and evolved to be more flexible and accurate for tasks such as image classification and segmentation. ?−? ? Furthermore, probabilistic data reduction tools and generative models, including variational autoencoder? and denoising diffusion probabilistic models ?,? have been applied for protein structure prediction and design. ?,? The development of neural network architectures ?,? and their profound impacts in predicting protein structures ?,?,? was awarded the 2024 Nobel Prizes in Physics and Chemistry, respectively. Trained by simulated or experimental data, ML methods can be routinely used as models for predicting untested inputs, ?,? which can facilitate operations in almost all areas of laboratory science, including experimental design, synthesis optimization, and materials characterization. ?,? However, significant challenges remains for applying ML approaches in accelerating laboratory research. For instance, the performance of the ML approach is highly dependent on the quality and quantity of the training data, and predictions may be unreliable when data is scarce or systematically biased. Furthermore, some complex ML models, such as neural networks and their variants, lack interpretability and require manual tuning of hyperparameters. Consequently, it is difficult for researchers to understand their underlying chemical mechanisms and reliably use them in different applications. Thus, high-quality data acquisition and understanding different ML model assumptions are key elements for successfully selecting and applying ML approaches to accelerate chemical laboratory research.
The development of high-throughput experimentation and data digitization has facilitated the deployment of AI/ML approaches to replace labor-intensive tasks in chemical laboratories by automated design and discovery processes.
Over the past decade, LLMs based on transformer architecture? have gained tremendous attention across the world for text generation, and have opened up a new era of scientific research. The transformer, a neural network architecture for training LLMs, for instance, inspired the development of the Generative Pretrained Transformer (GPT), ?,? and other LLM models, such as Claude, Gemini, Llama, Qwen, and DeepSeek. ?−? ? ? ? The versatility of LLMs for use in a variety of operations, ranging from literature summary to computer code generation, reduces barriers to learning new disciplines and facilitates interdisciplinary collaboration, which has started to transform the paradigm in chemical laboratory research. ?,? Despite the remarkable capabilities, LLMs also have notable limitations. These models can produce information that appears plausible yet is actually incorrect, and their performance remains inferior to domain experts when addressing complex chemical problems that require specialized knowledge. Additionally, LLM outputs exhibit randomness, which necessitates careful validation.
Today, we stand at a pivotal moment for radically transforming laboratory research and education. Traditional chemical laboratories require significant human labor for manual experimental designs, product screening, and data analysis, which can be substantially accelerated by robotic systems and AI agents, illustrated by the workflows in Figure. The LLMs and ML predictive models can encode multiscale, cross-disciplinary information, enabling scalable and accurate prediction for a large number of test samples, thereby substantially reducing the experimental cost and time. However, many researchers, particularly in experimental science, are unsure where to begin and what ML methods they should use to minimize deployment effort and cost. Although research tasks can be drastically different between chemical science communities, many involve forming, predicting, and understanding chemical relationships, i.e. f: x → f(x), where x can be descriptors of molecules, chemicals, experimental conditions or experimental outcomes, such as microscopy images and scattering curves, and f is a function that maps the input to system properties, such as conductivity, chemical reaction yields, structural and mechanical properties of the materials. Our modern world is built upon the discovery of maps that accurately predict previously unknown relationships. In the past, however, to discover the underlying principles of a new system, chemists often relied on time-consuming lab experiments and manual analysis of data in a traditional lab.
Laboratory workflows automated and accelerated by agentic AI.
Two critical advances have paved the way for data-driven discovery of unknown relationships in chemical science. First, experimental and simulation data have gradually become digitalized, enabling the use of fundamental statistical learning principles, such as Bayes’ theorem, to automatically update rules from the status quo, or prior distribution, to a new paradigm, or posterior distribution, by conditioning on new data. Second, ML models have advanced over the years to learn complex relationships from data, such as numerical values, texts, and sequences, which can substantially reduce time and computational cost for analyzing complex data. Through the lens of these changes, this outlook article will assess the current status of chemical laboratory research, highlight existing gaps, and suggest a path for uniting experimental and computational communities to accelerate progress.
Accelerating Data Collection and Processing
Data Acquisition
Materials synthesis, characterization, and simulation are three main sources of chemical data, shown in Figure(a), which produce, for instance, molecular sequences, curves, images, and videos (Figure(b)). The key goals are to accelerate and automate data collection, processing, and featurization (Figure(c)) for guiding the process of learning chemical relationships.
(a–c) Data collection, processing, and featurization in chemical research.
First, advances in automation are transforming the way materials are synthesized and fabricated for downstream analysis. ?,?,? Robotic platforms can be flexibly programmed to perform a range of chemical reactions and formulations with high precision and reproducibility, enabling parallel experimentation in multiwell plate formats. ?,?,? Flow chemistry further extends automation by providing continuous control over reaction conditions, incorporating in-line characterization tools for real-time monitoring, and improving safety when handling hazardous compounds. ?−? ? Once reactions are complete, automated flash purification systems and preparative high-performance liquid chromatography? streamline isolation of small molecules and can be adapted to generate well-defined polymer libraries with minimal human intervention. ?,? Beyond producing physical samples, these automated platforms generate distinct types of records, including molecular structures, reaction conditions, and experimental procedures, which can be digitized into machine-compatible formats. For instance, information on molecular structures can be converted into SMILES and SELFIES strings. ?−? ? Furthermore, efforts are being made to standardize experimental procedures, such as the Open Reaction Database? and Chemical Description Language,? for training ML models to optimize synthesis and reaction conditions. Commonly used methods to represent discrete inputs include one-hot encoding, which expresses discrete inputs by sequences of ‘0’ and ‘1’, and molecular fingerprints by numerical vectors. ?−? ? Encoding these methods helps bridge synthesis outputs with machine learning models that can analyze reaction trends and accelerate discovery.
Second, a wide range of materials characterization tools, including microscopy, rheology, spectrometry, scattering, and spectroscopy, have been developed. These tools generate images, time-series data, spectra, or other quantitative values in chemical laboratories. Data processing tools, such as image segmentation and particle tracking,? have been developed for extracting and linking data from microscopy images. These data processing tools have been implemented into software packages, such as ImageJ and Fiji, ?,? which contain easy-to-use graphical user interfaces (GUIs), empowering users to view and analyze large quantities of data, particularly useful for biochemical research.? The availability of a high volume of labeled data enables the development of more accurate supervised learning tools, such as Cellpose,? which utilizes a large database of labeled data to train U-Net,? a convolutional neural network for segmenting cells from microscopy images. For more challenging scenarios, such as capturing optically dense systems and fast dynamics, Fourier-based tools, e.g. differential dynamic microscopy (DDM), ?,? remove the need to segment particles to extract system information, e.g. mean squared displacement of the particles, that determine the mechanical properties (storage, loss modulus). ?,? Building upon existing tools, it is possible to construct probabilistic generative models and automated estimators for existing data processing methods, such as by removing manual selection of the Fourier range in DDM? which otherwise needs to be chosen on a case-by-case manner. ?−? ? ?
Third, computational simulations from distinct space-time length scales can provide scientific insights and a pathway to explore chemical systems before conducting chemical experiments. ?−? ? These simulations can reveal mechanistic insights prior to experimentation but are often limited by large computational and/or storage costs, and the need for accurate model calibration, such as determining the form of observed model parameters. ?−? ? To address this challenge, Meta FAIR has released Open Molecules 2025 (OMol25), a large-scale open-source data set comprising over 100 million density functional theory (DFT) calculations. It aims to provide high-accuracy quantum chemical data to support the development of machine learning models in molecular chemistry.? The past decade witnessed the success of ML surrogate models ?−? ? ? ? ? ? ? for predicting outcomes of expensive simulations, such as the potential energy, force field, and particle density at untested inputs from nanoscale to bulk environment. For example, neural network potentials and Gaussian process regression have been used to accelerate molecular dynamics and DFT calculations. ?,?,? Integrating ML-accelerated simulations into laboratory workflows can reduce the number of experiments in laboratories and guide synthesis toward the most promising targets. Realizing this vision requires closer collaboration between experimental and computational communities, ensuring that simulation-informed predictions are seamlessly incorporated into automated experimentation and data-driven discovery workflows.
As the tools used to inform laboratory operations have expanded and evolved, so has the need to record and manage data from these systems. Software, such as LIMS and ELNs, is capable of providing mechanisms for researchers to catalog and record key experimental data in ways that are searchable, labeled, uniquely identified, and accessible in machine-readable formats. Additionally, digital representations of laboratory protocols and associated data can simplify sharing and enable greater collaboration between researchers. The information in an ELN can be utilized to provide training data to update data-driven methods for prediction and optimization. Because of these advantages, physical notebooks of laboratories are gradually being replaced by ELNs. ?,? Furthermore, data from an ELN can be stored in or connected to a LIMS to enable comprehensive lab data management. ?−? ? Together, ELN and LIMS serve as tools that can foster open access data for researchers to retrieve, review, and analyze.
Input Featurization
and Visualization
As the input or descriptor x is not often available to learn chemical relationships f(x), domain knowledge, cheminformatics, and simulation are often used to generate feature sets that capture underlying chemical structures. Representative cheminformatics packages, including OpenBabel, RDKit, and CDK, have been integrated with popular programming languages (Table),? which enables processing scientific data to obtain meaningful input features for a wide range of problems.
1: Examples of Typical Cheminformatics Packages
Featurization of molecules and materials depends on available data, and they are often problem-specific. For applications involving categorical variables, such as catalyst type or solvent selection, one-hot encoding can represent each category as a binary vector. Yet this approach significantly increases the feature dimension when dealing with a large number of categories. To predict molecular properties, molecular fingerprints are widely used to encode molecules as fixed-length binary vectors, where each entry indicates the presence of a specific substructure,? This approach can rapidly screen similarity across large molecular libraries and it is widely used in pharmaceutical virtual screening and drug similarity searches. However, most molecular fingerprints provide 2-dimensional structural similarity and do not capture 3-dimensional geometry or electronic effects. To accurately capture 3-dimensional spatial configurations, electronic effects, and physicochemical interactions, computational tools, such as DFT calculations and molecular dynamics simulations, provide quantitative characterization of systems, such as partial charges and activation energies, which can be used as descriptors or features. ?,? A drawback of these computational models is their high computational cost, and the assumptions of the computational models may not be satisfied in some real-world systems. To reduce the high computational cost, ML surrogate models have been developed to predict system properties, including potential energy, electronic density, and atomic forces, from atomic features, such as inverse pairwise distances between atomic coordinates. ?,?,?,?−? ? Furthermore, the discrepancies between the computational models and reality can be modeled by Gaussian processes or their variants through model calibration approaches. ?−? ?
Accelerated data collection, processing, and featurization through computational and data-driven approaches enabled constructing advanced predictive models that encode multiscale, cross-disciplinary information to rapidly screen huge input spaces and substantially reduce experimental cost and time.
Furthermore, exploratory data analysis tools are commonly used for visualization and featurization.? A common challenging scenario for featurization involves high-dimensional data, including curves, images, or videos, and discrete inputs such as molecular sequences and graphs. Unsupervised dimension reduction tools, such as principal component analysis (PCA),? t-distributed stochastic neighbor embedding (t-SNE),? uniform manifold projection and reduction (UMAP),? dynamic mode decomposition,? autoencoders and decoders, ?,? are developed for extracting features of high-dimensional data. These methods can be used to visualize the high-dimensional data sets, and the vectors with a reduced dimension can be input as the features for ML models. As some information on the data sets will be inevitably lost in data reduction processes, ?−? ? it is crucial to understand the underlying assumptions of these data-driven approaches. For instance, the linear subspace of PCA is equivalent to the maximum marginal likelihood estimator of a probabilistic latent factor model,? and one can examine whether the probabilistic model is suitable to represent the data sets for dimension reduction. Domain knowledge, such as physical and chemical principles, can also be used to reduce the dimension of data and improve the accuracy of noisy experimental data. For instance, for classifying phases of block copolymers by small-angle X-ray scattering (SAXS) data, using several features relevant to the location, width, and curvature of the primary peaks of the X-ray curves substantially improves the predictive accuracy of ML models compared to using the entire curve as input in ML models.? Furthermore, scattering measurements were used to estimate the micelle structure of block copolymer solutions inversely,? and ML surrogate models can improve the inverse estimation by learning the map from reduced-dimensional features of micelle structural parameters to scattering patterns.?
The overarching goal of featurization is to inform the similarity of chemical candidates in terms of their system properties. A common challenge of featurization involves discrete or categorical inputs, such as different types of atoms, molecules, and chemical bonds. Compared with numerical inputs, discrete inputs are more challenging to model due to the lack of ordering between the inputs. ML models have achieved success for predicting discrete sequences in some applications, including transformers in LLMs that predict the next text token given the context,? and AlphaFold that maps amino acids to protein spatial structure.? These advanced ML approaches are capable of learning latent features embedded in a continuous space from a large number of samples. Large opportunities exist to emulate the success in these examples in other areas of chemical and materials sciences, by identifying objective functions, constructing standardized data sets, and developing novel ML architectures for discovering complex relationships from high-dimensional discrete inputs or mixed discrete and continuous inputs.
Learning Chemical Relationships
by Predictive Models
Predictive Models
A predictive model, sometimes referred to as statistical methods of chemometrics by chemists,? is an indispensable component for learning chemical relationships. A common goal is to predict system properties for a given input or condition and quantify the uncertainty of the prediction, a process typically involving training a data-driven predictive model. We will start from regression, in which a continuous-valued outcome is typically modeled by
where f(x) that maps the input vector to system properties, and ϵ is an independent noise, often assumed to follow a zero-mean Gaussian distribution. We introduce four classes of widely used predictive models listed in Figure. All these models can be generalized to predict categorical data and counts (often called classification), which utilizes a link function to convert continuous numerical predictions into probabilities. For example, a logistic function? can map any real number to a probability value between 0 and 1.
Data-driven predictive models for chemical research.
A linear model or linear regression is potentially the oldest and most widely used benchmark model. In linear regression, the relationship is assumed to be linear: , where β_0_ is the coefficient of an intercept, x _ j _ and β_ j _ are the jth input variable and coefficient, respectively. Statistical theory has been well established for estimating the coefficient of linear regression, along with their uncertainties, for noisy observations. Due to the assumption of linearity, linear regression typically does not require large amounts of data to estimate the parameters. However, when processing chemical measurements (such as spectral data), extremely high dimensionality and multicollinearity- of the data could cause traditional regression approaches to fail. Partial least-squares regression extends linear models to high-dimensional spaces,? which integrates the dimension reduction idea of PCA. In addition to prediction, linear methods offer a rigorous framework for statistical inference, hypothesis testing, and variable selection for automating model construction. ?,? Therefore, though the predictive power of a linear model is constrained by its restrictive assumption, the interpretability and the ease of fitting the linear model make it a suitable benchmark model to estimate unknown chemical relationships.
Tree-based methods? assume locally constant relationships through partitioning the input variable or feature space: , where is an input subspace and c _ m _ is a fitting value of the subspace.? Ensembles of trees, such as random forests ?,?,? and gradient-boosted trees,? construct multiple trees for prediction or classification, and they are widely used for their robustness and ability to model nonlinear relationships. Random forests, for instance, construct multiple decision trees in parallel, each trained on a bootstrap sample generated by randomly sampling the observed input-output pairs with replacement. Predictions are obtained by aggregating across all trees, via the majority vote for classification or averaging for regression, which reduces variance and mitigating overfitting. In contrast, gradient-boosted trees are built sequentially, with each new tree focusing on correcting the residuals or errors of the previous model. These methods naturally handle both numerical and categorical inputs, are insensitive to feature scaling, and are computationally efficient. In addition, tree-based methods and ensembles of trees provide feature importance metrics to quantify the contribution of each feature to prediction accuracy, which allows researchers to identify key features that dominate the properties of molecules or materials. Tree-based methods, however, may not be efficient to capture smooth functional relationships, as they are constrained by the assumptions that outputs can be approximated by the ensembles of local predictors of the input subspaces.
Gaussian process regression (GPR) utilizes a kernel function to characterize the distance between the input space.? Any latent functions evaluated at n input values, , are assumed to follow a multivariate normal distribution in GPR: , where μ is a n-vector of mean often assumed to be constant, and Σ is an n × n covariance matrix that often parametrized by a kernel function. When f(·) is a continuous function, the two outcome values will be similar if the associated two inputs are close to each other. The distance between inputs is characterized by kernel functions in GPR, to measure how close two sets of input conditions, such as experimental conditions and molecular descriptors, are to each other. Commonly used kernel functions include power exponential kernels and Matérn kernels,? and chemical information and physical symmetries can be integrated into kernel functions for prediction. ?,? Conditioning on a set of observations, the predictive mean of GPR provides point predictions, equivalent to solution of kernel ridge regression,? and the predictive intervals of GPR assess the confidence of a prediction, which identifies which experimental conditions are worth further exploration. Compared to linear models and tree-based models, Gaussian processes often require less training data to learn nonlinear relationships, when the underlying map is smooth. The high efficiency with respect to small samples and availability of uncertainty make the Gaussian process a suitable candidate for predictions and design optimization in chemical studies.? When the sample size is large, approximation methods ?,? are often required for GPR due to the computational expense that involves inversion of a large covariance matrix, yet the predictive accuracy can deteriorate.
Artificial neural networks (NNs) are capable of learning intricate patterns from large data sets by stacking multiple layers of nonlinear transformation. ?,? Each layer applies a nonlinear activation function σ(·) after a linear transformation of the input with weight matrix W ^(l)^ and bias vector b ^(l)^, allowing the network to approximate arbitrarily complex functions by combining nonlinear building blocks. Mathematically, a feedforward NN can be represented as f(x) = f ^(L)^(f ^(L–1)^(...f ^(1)^(x))), where each layer follows f ^(l)^(x ^(l–1)^) = σ(W ^(l)^ x ^(l–1)^ + b ^(l)^) with σ(·) acting on each element of the input. The large number of parameters enables neural networks to effectively learn nonlinear input-output mappings, including those relationships that are difficult to model using traditional methods. In recent years, many NN architectures,? such as convolutional neural networks? and recurrent neural networks,? have found great success, particularly for image analysis such as image classification,? segmentation,? generation and inpainting. ?,? As the neural network models often require a large amount of data to train, they are suitable for certain scenarios, such as learning potential energy and atomic forces from simulation, ?,? and segmenting cells from microscopy images.?
Examples of the Python and R packages for the four classes of predictive models are given in Table. These approaches have been widely used for predicting experimental outcomes ?,? or as a surrogate model for approximating computationally expensive simulations.? In practice, it is also critical to have reliable uncertainty quantification of the predictions, expressed as predictive intervals, for optimizing experimental designs? and controlling predictive error.? The predictive intervals of linear regression and Gaussian processes are often used to quantify uncertainty of the prediction, and they can be easily computed. The predictive uncertainty of tree-based methods or ensembles of trees can be computed through Bayesian inference, quantile regression, or asymptotic analysis. ?−? ? Assessing the uncertainty of neural network approaches is still an open area of research, with various methods been developed, including dropout, ensemble samples, and conformal inference. ?−? ? ? ?
2: Examples of Python and R Packages for Predictive Models
Experimental
Design Optimization
Leveraging the predictive power from simulation and ML methods enables the efficient design of experiments to understand an enormous space of molecules and materials. A primary goal of efficient materials design can be mathematically formulated as an optimization problem: x* = arg max_ x _ g(x), where g(x) is the objective function, representing the system property to be maximized or minimized (such as reaction yield) for given input x (such as materials and experimental conditions). DoE offers powerful tools for exploring multivariate variable spaces, particularly for factorized inputs, which can substantially accelerate the conventional process to tune one variable at a time (OVAT).? As experimental data contains noise, and design input space can be enormous, applying traditional optimization methods such as a quasi-Newton method? can be prohibitive, as they typically require gradient information, noise-free outcomes of the objective functions, and a relatively large number of evaluations. To overcome these challenges, a predictive model, such as a Gaussian process, can be used as a probabilistic proxy to sequentially design the next experiments that give the most valuable experimental outcome through an acquisition function (a strategy that balances exploration of new regions with exploitation of known good regions). This process is often referred to as Bayesian optimization or active learning.? The quantified uncertainty from the predictions is crucial to strike a balance between exploration and exploitation for making better predictions and improving the gain function, respectively.?
Filling
the Gaps by LLM Agents
Advancing laboratory research involves a large set of tools and techniques. The rise of LLMs, such as ChatGPT, offers a promising path forward in connecting distinct domains to accelerate learning and problem formulation processes, where LLMs act as the agent at the interface between chemists and data scientists, enabling researchers to quickly learn knowledge in other disciplines. Figure illustrates several potential applications of LLMs, including generating computer code to perform data analysis for chemists and helping computational experts better understand concepts in chemistry which leads to developing new computational tools for predicting polymer phases that will be introduced in the first case study.? By accelerating learning processes and reducing communication barriers, LLMs can serve as helpful mediators to facilitate collaborations between distinct communities.
LLM agents facilitate cross-disciplinary collaboration and skill development in chemical research. (a) Example dialogue of chemists acquiring Python programming skills for data analysis. (b) Skill progression framework from basic computational tools to advanced chemistry-AI applications. (c) Example dialogue of LLM agents helping explain chemical concepts.
Several recent studies have explored the use of LLMs in chemical research, including assisting with coding and framing scientific questions using chemical data. ?−? ? LLMs offer an accessible entry point for novices lacking computational skills, enabling efficient data processing, high-quality visualization,? and generating computer codes with little prior programming experience. ?,? In surveys conducted after introducing LLMs as learning tools, users reported notable improvements in their coding skills, demonstrating that LLMs can accelerate learning with minimal barriers.? Beyond basic use, LLMs can support general chemistry problem-solving,? and they can be fine-tuned for domain-specific tasks to further enhance output quality.? However, the responses from LLMs may not be accurate, and they sometimes may hallucinate. Consequently, LLMs cannot replace experts in many disciplines, and their answers should be verified by domain experts.
Figure provides examples of distinct expertise from chemists and data scientists for building a collaborative workflow with the aid of LLM agents. Conversely, the expertise can contribute to enhancing LLM agents, as LLMs are essentially trained on text sequences, including dialogues, publications, and computer code. To amplify the utilities of LLMs, it is imperative to educate students and researchers on framing the laboratory research tasks as properly defined mathematical problems. Furthermore, As the LLM agents largely remove the barriers of learning and programming, the existing curriculum of chemical science can include more components of statistical machine learning and data analysis with the assistance of LLM agents.
Collaborative workflows between chemists and data scientists facilitated by LLM agents.
LLM agents can serve as an interface between chemists and data scientists, to assist in a variety of tasks, such as lowering the barriers for accessing the domain knowledge and performing data analysis, but their answers should be validated carefully.
Case Studies
Physics-Informed
Machine Learning for Automated Block Copolymer Phase Identification
Nature has long mastered the synthesis and use of well-defined macromolecules in biology. While this level of structural specificity remains out of reach with most synthetic polymers, significant progress has been made in preparing precise polymers and developing new strategies to access well-defined materials in high-throughput. ?−? ? ? ? When these methods leverage common laboratory equipment that is simple to use and broadly available, it can facilitate widespread use in answering fundamental questions or carefully tailoring structure–property relationships for a specific application.? For example, Hawker and co-workers have recently demonstrated the use of automated chromatography to rapidly generate block copolymer libraries.? Block copolymers are an important class of materials that self-assemble into a rich array of nanoscale morphologies.? Key to applications, such as advanced separation membranes, thermoplastic elastomers, photonic crystals, microelectronics, and drug delivery, is the ability to tune self-assembly through synthetic handles, including block chemistry, block sequence, composition, molecular weight, and dispersity using controlled polymerization techniques. ?−? ? ? This long list of structural variables illustrates the difficulty in navigating and controlling a multidimensional design space. Traditional methods of constructing even an incomplete block copolymer phase diagram involve iterative synthesis followed by multiple purification and isolation steps, which are time-consuming and labor-intensive. The repetitive synthesis of multiple block copolymers is also complicated by slight variations in reaction conditions and/or purification that led to undesired differences among samples and the presence of variable amounts of homopolymer impurities.
This process can be substantially accelerated and automated by leveraging the advances of techniques and predictive models shown in Figure. For example, a library of 20 well-defined diblock copolymers, spanning a broad range of compositions, was readily prepared in 1 h from a single parent block copolymer and used to prepare an enhanced phase diagram. ?,?,? Because automated chromatography accelerates polymer library construction so significantly, it is essential to pair it with more efficient methods for mapping phase diagrams of diverse block copolymer chemistries. SAXS can determine the polymer phases of these samples, yet it requires an expert to manually identify the phase of the polymer by interpreting SAXS curves, which is time-consuming. This problem This problem was addressed by BioPACIFIC MIP team through the development of a physics-informed predictive model to automate polymer phase identification from SAXS.? Instead of inputting the entire SAXS data into ML models for classifying polymer phases, the authors extend the Kalman filter? for automated peak detection to extract physics-informed morphological features (PIMF), including the peak locations, width, and sharpness of the peaks. These features are used to construct a random forest model,? suitable for classification problems with a small to medium number of training samples. Identifying the phases of hundreds of samples using the random forest model takes less than a second on a desktop computer, and it can be executed without the help of a computational expert.
Accelerated workflow for block copolymer phase identification comparing traditional (red) and automated (green) approaches. At each decision point, automated approaches reduce time or improve accuracy compared to conventional methods. Adapted with permission from ref . Copyright 2025 Wiley.
The PIMF from SAXS curves substantially improved the predictive accuracy, achieving around 95% out-of-sample accuracy even for predicting new monomers with different volume fractions not in the database for training ML models.? The substantial improvement comes from the integration of polymer theory for featurization in machine learning algorithms for determining polymer phases, which dramatically reduces the dimension of the input space in predictions. Furthermore, the maximum prediction probability from a machine learning model, such as a random forest classifier, can be used for quantifying the uncertainty of the prediction. The assessed uncertainty enables reinspecting a small subset of the samples with maximum prediction probability lower than a prespecified threshold, to achieve near 100% accuracy for polymer phase identification. Furthermore, the authors found 3 samples that were mislabeled by the expert but predicted correctly by the ML model.
As polymer phase identification is a new problem for the data scientists, the LLM was used to efficiently acquire domain-specific knowledge about block copolymer behavior and SAXS curves, as illustrated in Figure(c). This LLM-assisted process accelerates the learning process required in interdisciplinary collaboration. This example illustrates how the integration of advanced experimental approaches, data-driven predictive models, and domain expertise expedites the characterization of structure–property relationships.
ML-Guided Experimental Screening for Discovery
of DNA-Stabilized Silver Nanocluster Fluorophores
DNA-stabilized silver nanoclusters (DNA-Ag_N_) are ultrasmall fluorescent nanoparticles with highly tunable properties. First reported in 2004, DNA-Ag_N_ contains only 10 to 30 silver atoms stabilized by one to three single-stranded DNA oligomers. ?−? ? DNA-Ag_N_ are attractive for their sequence-tuned excitation and emission wavelengths that can be tuned from blue to near-infrared (NIR) by the DNA template sequence. ?,? Together with high quantum yields and extinction coefficients, these properties make DNA-Ag_N_ promising emitters for biosensing, bioimaging, and nanophotonics. ?,? For example, emerging NIR-emitting DNA-Ag_N_ could enable deep tissue imaging within the NIR tissue transparency window, where biological tissues and fluids are highly transparent to electromagnetic radiation.?
The unique sequence-programmed nature of DNA-Ag_N_ presents opportunities to engineer these emitters precisely for specific applications, but DNA-Ag_N_ design is highly challenged by the large number of possible templating DNA sequences. Most sequences do not yield useful fluorescent DNA-Ag_N_, and the rules connecting DNA sequence to DNA-Ag_N_ properties are complex.? Moreover, very few X-ray crystal structures of DNA-Ag_N_ have been reported, and first-principles computational modeling is currently intractable for DNA-Ag_N_ design. ?,?−? ?
Copp, Bogdanov, and coauthors have developed approaches that combine high-throughput experimental synthesis and characterization with ML models ?−? ? ? ? to significantly increase DNA-Ag_N_ design efficacy, using the workflow in Figure. First, automated liquid handling is used to synthesize DNA-Ag_N_ on 10^3^ different DNA oligomers in well plates, with one oligomer sequence per well. The fluorescence spectrum of each sample is then collected using automated fluorimetry with a well plate reader; universal UV excitation via the nucleobases is employed to excite all DNA-Ag_N_ with a single wavelength for rapid fluorimetry. Finally, automated spectral fitting is used to determine the spectral peak parameters for each DNA sequence, thereby generating a large data library that connects DNA sequences to DNA-Ag_N_ fluorescence.
Workflow for ML-enabled DNA-AgN discovery. Experimental DNA-AgN synthesis is performed on 103 DNA oligomers with different sequences, and automated fluorimetry is used to generate training data for ML models. Chemical information guides the choice of the ML problem definition and feature engineering, enabling predictive ML with limited experimental training data and interpretation of sequence-to-property relationships learned by the model. Adapted with permission from ref . Copyright 2022 American Chemical Society.
This data set has been leveraged to train chemistry-informed classification models, due to the quantized “magic number” properties of nanoclusters, which naturally yield certain DNA-Ag_N_ sizes.? Chemically informed featurization has been essential for ML classifiers to learn sequence-to-color relationships, rather than using simple methods such as one-hot encoding. For example, by featurizing DNA sequence using nucleobase “staple” motifs inspired by DNA-Ag_N_ crystal structure,? support vector machines? were trained to predict the emission color class of a DNA-Ag_N_ given input DNA sequence.? To ensure that the ML models are effective for predicting nanocluster properties, it is important to mitigate overfitting and data imbalance issues commonly encountered in experimental nanocluster data sets, which are often limited in size. An imbalanced data set is one that contains relatively few examples of some types of materials while containing more examples of others, e.g., the histogram in Figure panel 2 with very few data points with wavelengths nm as compared to other wavelengths. ML models trained on imbalanced data sets can perform poorly in regions where little training data is available. Regularization techniques by adding a penalty term to the loss function in estimating parameters are effective in reducing overfitting in SVMs, ?,? and classifier ensembles can be used together with data balancing strategies such as subsampling to achieve predictive ML models with limited experimental data sets. More recently, deep learning models that perform automatic feature extraction and enable continuous property design were introduced and demonstrated for DNA-Ag_N_. ?,? Beyond prediction, ML models can provide valuable chemical insights into how DNA sequence influences DNA-Ag_N_ color through interpretability analysis using feature analysis tools such as BorutaSHAP.?
Experiments have verified the efficacy of ML-guided design approaches for DNA-Ag_N_. One of the most notable findings is the discovery of NIR-emitting DNA-Ag_N_, which are rare in training data libraries, yet can be designed at a 12.3 times enhanced success rate using ML-guided sequence selection.? This strategy illustrates the strength of integrating domain knowledge (DNA-Ag_N_ chemistry) and ML algorithms to facilitate the systematic discovery of materials and to enhance fundamental chemical understanding in ways that are not achievable using conventional methods.
Open-Source Bayesian Optimization Tool for Reaction Development
in Small-Molecule Organic Synthesis
Experimental optimization is ubiquitous in small-molecule organic synthesis. These optimization problems are usually high-dimensional, with reaction spaces defined by both categorical variables (e.g., reagent and solvent identities) and continuous variables (e.g., catalyst loading and temperature). A synthetic chemist selects the initial reaction space to explore based on successful conditions for similar reactions, mechanistic reasoning, and chemical intuition, then iteratively performs rounds of experiments with varied conditions to seek the optimum. The most common conventional strategy for exploration of this space, namely OVAT testing, has proven effective, but is inefficient for exploring a large number of variables and overlooks interactions between variables.
Bayesian optimization (BO) is well-suited to reaction optimization, as it can suggest multiple experiments by utilizing the quantified uncertainty of a predictive model in a search space defined by both categorical and continuous parameters, to ultimately identify the global optimum in a low-data regime.? In this setting, a successful BO algorithm could substantially reduce the number of experiments necessary to complete optimization.
In 2021, the Doyle group developed Experimental Design via Bayesian Optimization (EDBO), an open-source Python package for reaction development.? The algorithm was tested with featurization at three different levels of chemical information: one-hot encoding, wherein the algorithm is given only the name of each reaction component and its binary presence or absence in each set of conditions; Mordred cheminformatics descriptors,? which provide basic information about reaction components such as polarizability, molecular weight, and number of aromatic rings; and properties that require DFT calculation, such as the charges at the coordinating atoms of potential ligands in their minimum energy conformers, which are the most computationally expensive but also provide the richest information. When tuned for each level of featurization using experimental data from the chemical literature, the algorithm performed similarly on average during the optimization process. However, DFT-level encoding gave the most consistent results, with its worst optimization trials still converging on a yield within 5% of the optimum. DFT encoding was thus selected for use in the remainder of the study; these results suggest that users with similar search spaces may still find the algorithm effective even with only simple one-hot encoding. Additionally, optimizer tuning revealed the best performance using a Gaussian process surrogate model? and parallel expected improvement? as the acquisition function. This acquisition function suggests batches of experiments that maximize expected utility until the objective is optimized or the reaction space is explored sufficiently that the probability of finding an improved condition is low. This platform can be used in diverse settings for any parametrizable reaction, including everyday bench-scale experimentation and automated systems, making it widely applicable for modern chemical laboratories.
To benchmark the EDBO algorithm’s performance against the choices of human experts, Doyle and co-workers developed a computer game that asked the player to find the highest-yielding conditions for a Pd-catalyzed C–H arylation reaction within a search space of 1,728 possible reaction conditions, defined by three categorical variables (solvent, ligand, and base identity) and two continuous variables (temperature and concentration). To mimic a real laboratory, the resource budget was limited: players chose 5 experiments to run “per workday” and had 20 “workdays” to maximize the yield of the reaction. The experimental outcomes supplied to the players were real, with the yield data for every possible reaction being collected beforehand via HTE.
For performance comparison, 50 expert chemists were asked to play the benchmarking game and the EDBO algorithm was asked to play it a corresponding 50 times (Figurea). While human experts selected higher-yielding conditions on average for the first round of experiments, the optimizer’s average performance surpassed humans’ average performance in only three “workdays” and typically achieved quantitative yield within the first ten. In addition to EDBO’s greater efficiency, it displays improved consistency: the optimizer identified the optimal conditions every time it played the game, while many human participants concluded they had identified the best conditions before achieving quantitative yield and stopped optimization early.
Experimental design via Bayesian optimization. (a) Validation of Bayesian reaction optimization via direct comparison between human performance (left) and machine learning performance (right); optimization curves for individual players and optimizer runs (dashed) and average (solid) as a function of experiment batch (size: 5). (b) Optimization of a Mitsunobu reaction via EDBO: cumulative best observed yield (black) and individual experiment outcomes (gray) as a function of experiment batch (batch size: 10), yield for standard reaction conditions (red dashed). Adapted with permission from ref . Copyright 2021 Springer Nature.
To demonstrate the platform’s ability to optimize real-world reactions used in pharmaceutical development, Doyle and co-workers applied EDBO to a test case of the Mitsunobu reaction.? This reaction was selected because it is used frequently in synthesis, but tends to deliver moderate yields under standard conditions. Methyl 3-bromo-1H-indole-6-carboxylate and benzyl alcohol were chosen as substrates. These substrates afforded a moderate 60% yield of the desired product under the standard conditions used at Bristol Myers Squibb. Seven total categorical and continuous reaction parameters were selected to define the reaction space: the identity and equivalents of the azadicarboxylate reagent, the identity and equivalents of the phosphine reagent, the identity and concentration of the solvent, and the temperature. Chemical information about the reagents and solvent was encoded in the form of DFT-computed descriptors. With 6 azadicarboxylates, 12 phosphines, 5 equivalencies for each reagent, 5 solvents, 4 concentrations, and 5 temperatures, the full reaction space consists of 180,000 possible combinations.
With the search space in hand, EDBO was initialized with conditions chosen at random. Ten reactions were run in parallel per experiment batch. The optimizer identified three conditions that delivered the product in nearly quantitative yield (99%) in only four rounds, totaling 40 experiments (Figureb). EDBO’s ability to deliver a suite of distinct optimized conditions is advantageous, as it enables chemists to choose between several options based on additional factors such as cost and operational convenience.
In 2022, the Doyle group expanded the utility of EDBO with the release of EDBO+.? The upgraded platform accommodates multiobjective optimization and allows the user to modify the reaction space during the optimization campaign. These improvements adapt the system well to common use-cases in organic synthesis, where multiple objectives (e.g., yield, selectivity, cost) are often in play and condition space is routinely updated as the system is better understood. In addition to its availability as an open-source software package, EDBO+ can be used via a web-based application with a step-by-step graphical user interface designed for users who have little to no coding knowledge, which helps bridge the gap between data scientists and experimental chemists. Furthermore, the integration of EDBO+ as a decision-making tool with other data-driven technologies is already showing promise: the year after its release, EDBO+ proved effective for the optimization of a pyridinium salt synthesis via continuous flow with semiautomated low-resolution data processing,? which is gaining popularity for automated reaction development. ?,?
Summary and Outlook
Chemical lab research has been transformed by the availability of large volumes of digital data generated by high-throughput experimental facilities that are increasingly automated. These data offer unique opportunities to develop new approaches and algorithms to substantially accelerate the discovery process. A key step to advance lab research is to formulate lab tasks as mathematical questions, which is crucial to leveraging progress in machine learning algorithms and AI tools. As many chemical tasks involve identifying unknown relationships, a suitable predictive model can open doors for numerous applications, including accelerating experimental design, processing, and optimization of material properties. To bridge the knowledge gap between distinct areas, LLM agents can help chemical scientists select suitable predictive models, provide standard computer code, and assist computational experts in understanding domain knowledge for developing algorithms to facilitate the discovery process. Furthermore, the answers from LLM agents may inspire new ideas and facilitate the discovery process. Yet LLM agents may generate inaccurate responses and can fabricate or hallucinate information about nonexistent theorems or references, which may lead to unsafe experiments, such as providing access to synthesis information that poses security issues. Prompt engineering, including providing contexts and examples, breaking large research questions into smaller pieces, and integrating coscientists specializing in different domains, can guide LLMs to generate more accurate solutions.? Some of these strategies require not only domain knowledge, but also more understanding of data science. Thus, integration of more statistical thinking and machine learning concepts into the pedagogy of chemical science can assist chemists in better interacting with LLM agents and ensuring the correctness of LLM-derived solutions.
Overcoming several other common challenges can lead to fruitful outcomes in advancing lab research. First, many experimental characterization tools produce data that are, to varying degrees, closed-source, meaning that access to the data is restricted to an ecosystem supported only by the vendor. Recent efforts have been made to facilitate connections between closed-source vendor ecosystems and external software (e.g., LIMS, ELN, or analysis tools) by gaining access to application programming interfaces (APIs) directly from the vendors. For example, a software development kit in a common programming language (Python) was developed and released to consume the API for the HTE instrument, thereby providing greater access to system commands.? Efforts to convert proprietary data into standard formats and share them in an open-source repository can cultivate community efforts. The availability of a standard format of data has driven, for instance, the progress in LLMs and accurate protein structure prediction tools, such as Alphafold.? Furthermore, there is a vast need to develop standard software that can be easily plug-in into daily experimental tasks, including automating data processing, making reliable predictions of chemical relationships, generating interpretable analysis of experiments, and suggesting solutions for experimental challenges. These tools need to overcome several challenges, including the limited number of training samples in experiments, automating model training processes, enabling uncertainty assessment, and assimilation to integrate different types of data. On the other hand, a deeper understanding of the assumptions behind these tools enables chemists to better deploy them in suitable scenarios, identify the reasons when ML tools do not work well, and resolve problems more quickly when interacting with AI agents. Together, the joint efforts in experimental and computational fields can substantially accelerate the discovery process in chemical science.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abolhasani M.Kumacheva E.The rise of self-driving labs in chemical and materials sciences Nat. Synth.2023248349210.1038/s 44160-022-00231-0 · doi ↗
- 2Tom G.Schmid S. P.Baird S. G.Cao Y.Darvish K.Hao H.Lo S.Pablo-García S.Rajaonson E. M.Skreta M.Self-driving laboratories for chemistry and materials science Chem. Rev.20241249633973210.1021/acs.chemrev.4c 0005539137296 PMC 11363023 · doi ↗ · pubmed ↗
- 3Vriza A.Chan H.Xu J.Self-driving laboratory for polymer electronics Chem. Mater.2023353046305610.1021/acs.chemmater.2c 03593 · doi ↗
- 4Wang C.Kim Y.-J.Vriza A.Batra R.Baskaran A.Shan N.Li N.Darancet P.Ward L.Liu Y.Chan M. K. Y.Sankaranarayanan S. K.R.S.Fry H. C.Miller C. S.Chan H.Xu J.Autonomous platform for solution processing of electronic polymers Nat. Commun.202516149810.1038/s 41467-024-55655-339962040 PMC 11833048 · doi ↗ · pubmed ↗
- 5Seifrid M.Pollice R.Aguilar-Granda A.Morgan Chan Z.Hotta K.Ser C. T.Vestfrid J.Wu T. C.Aspuru-Guzik A.Autonomous chemical experiments: Challenges and perspectives on establishing a self-driving lab Acc. Chem. Res.2022552454246610.1021/acs.accounts.2c 0022035948428 PMC 9454899 · doi ↗ · pubmed ↗
- 6Mac Leod B. P.Parlane F. G. L.Morrissey T. D.Hase F.Roch L. M.Dettelbach K. E.Moreira R.Yunker L. P. E.Rooney M. B.Deeth J. R.Lai V.Ng G. J.Situ H.Zhang R. H.Elliott M. S.Haley T. H.Dvorak D. J.Aspuru-Guzik A.Hein J. E.Berlinguette C. P.Self-driving laboratory for accelerated discovery of thin-film materials Sci. Adv.20206 eaaz 886710.1126/sciadv.aaz 886732426501 PMC 7220369 · doi ↗ · pubmed ↗
- 7Zhang W.Guy M. A.Yang J.Hao L.Liu J.Hawkins J. M.Mustakis J.Monfette S.Hein J. E.Leveraging GPT-4 to transform chemistry from paper to practice Digit. Discovery 202432367237610.1039/D 4DD 00248 B · doi ↗
- 8Wu T. C Aguilar-Granda A.Hotta K.Yazdani S. A.Pollice R.Vestfrid J.Hao H.Lavigne C.Seifrid M.Angello N.Bencheikh F.Hein J. E.Burke M.Adachi C.Aspuru-Guzik A.A materials acceleration platform for organic laser discovery Adv. Mater.202335220707010.1002/adma.20220707036373553 · doi ↗ · pubmed ↗
