# Machine Learning-Based Predictive Modeling of Infrared Spectroscopic Data from Thermal Conversion of Athabasca Bitumen

**Authors:** Noora Al Mansoori, Munawar Abdul Shaik, Kaushik Sivaramakrishnan

PMC · DOI: 10.1021/acsomega.5c04463 · ACS Omega · 2025-07-02

## TL;DR

This paper uses machine learning to predict infrared spectra from bitumen thermal cracking, aiming to replace slow physical measurements with fast predictions.

## Contribution

The novel use of gradient boosting regression with Bayesian optimization for accurate and efficient prediction of FTIR intensities in thermal bitumen conversion.

## Key findings

- Gradient boosting regression (GBR) achieved up to 99.66% prediction accuracy in FTIR intensity modeling.
- GBR outperformed other models like random forest and k-NN in scenarios with varying and high temperatures.
- Bayesian optimization improved model performance by tuning hyperparameters effectively.

## Abstract

This study explores the use of machine learning (ML)
techniques
to predict Fourier-transform infrared (FTIR) intensities of products
from the thermal cracking of Athabasca bitumen, aiming to develop
a reliable soft-sensor. The ultimate goal is to obtain the FTIR spectra
of the thermally cracked products online to reduce process time from
slow physical measurements. Various ML models, including Linear Regression
(LinR), partial least squares regression (PLSR), support vector regression
(SVR), K-nearest neighbors (k-NN), random forest
(RF), and gradient boosting regression (GBR), were implemented to
enhance the predictive accuracy and efficiency of FTIR spectroscopy,
aiming to reduce the need for traditional physical measurements which
are often slow compared to the rapid predictions offered by ML techniques.
To assess the model’s generalization capabilities, with respect
to model predictions, the models were trained and tested across four
different scenarios with varying temperature data obtained from visbreaking
experiments performed on Athabasca Bitumen at temperatures ranging
from 25 to 420 °C with reaction times ranging from 15 min to
27 h. Scenario 1 included all 61,740 data points utilizing an 80/20
train-test split with 10-fold cross-validation (CV). Scenario 2 involved
training on temperatures of 25, 350, and 400 °C and testing on
300, 380, and 420 °C. Scenario 3 involved training on temperatures
of 350, 380, and 400 °C and testing on 25, 300, and 420 °C.
Finally, Scenario 4 involved training on temperatures of 25, 300,
350, and 380 °C and testing on 400 and 420 °C. Bayesian
optimization was employed for hyperparameter tuning to identify the
optimal configurations for each model. The results indicate that ensemble
methods, particularly GBR, consistently achieved the highest predictive
accuracy (R
2) and lowest root mean squared
error (RMSE) across all scenarios. In Scenario 1, GBR achieved a prediction
accuracy of 99.66%. Scenario 2 highlighted the models’ ability
to generalize across varying temperatures, with both RF and GBR achieving
similar performance with high prediction accuracies of around 94%.
Scenario 3, characterized by significant temperature variability,
demonstrated the robustness of GBR, which outperformed RF and k-NN with a predictive accuracy of 92.15%. Scenario 4, focusing
on high-temperature predictions from low-temperature training data,
showed that GBR still performed robustly with a predictive accuracy
of 80.40%. The study concludes that GBR models, particularly those
with well-tuned hyperparameters, are highly effective in predicting
FTIR intensities, outperforming other techniques like RF, k-NN, LinR, and PLSR. The integration of advanced ML techniques
and Bayesian optimization significantly enhances the capability to
predict FTIR spectra, providing a reliable soft-sensor as an alternative
to traditional physical experimentation methods. This approach not
only saves time and resources but also ensures consistent and high-quality
predictive performance in chemical analysis and monitoring.

## Full-text entities

- **Genes:** CDK2 (cyclin dependent kinase 2) [NCBI Gene 1017] {aka CDKN2, p33(CDK2)}
- **Diseases:** ML (MESH:D007859), tumorigenesis (MESH:D063646), cancer (MESH:D009369)
- **Chemicals:** polymer (MESH:D011108), DAO (-), bitumen (MESH:C006647), hydrocarbon (MESH:D006838), olefin (MESH:D000475), N (MESH:D009584), glycerol (MESH:D005990), asphaltenes (MESH:C000592077), indene (MESH:C093581)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12268744/full.md

## Figures

14 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12268744/full.md

## References

56 references — full list in the complete paper: https://tomesphere.com/paper/PMC12268744/full.md

---
Source: https://tomesphere.com/paper/PMC12268744