# Forecasting cashew production in India using a hybrid machine learning framework with STL decomposition, ensemble methods, and global trade network analysis

**Authors:** Shinyclimensa C, Parthiban A

PMC · DOI: 10.1038/s41598-025-29254-1 · 2025-12-17

## TL;DR

This paper develops a machine learning framework to forecast India's cashew production and analyze its global CNSL trade network, offering insights for policymakers and agri-business strategists.

## Contribution

A novel hybrid machine learning framework with rolling STL decomposition and a new Source-Importer Ratio metric for trade network analysis are introduced.

## Key findings

- Gradient Boosting outperformed other models in forecasting cashew production with an R² of 0.988 and MAPE of 3.6%.
- India's CNSL trade network exhibits a star-like topology with India as the dominant hub and significant disparities in trade influence.
- The framework provides actionable insights for strengthening supply chain resilience and export diversification.

## Abstract

This study presents a comprehensive analytical framework to examine and forecast the dynamics of India’s cashew production and cashew nut shell liquid (CNSL) exports. The analysis comprises two integrated components: a machine learning-based production forecasting system and a network topology analysis of India’s global CNSL trade relationships. For production forecasting, we develop a hybrid pipeline that integrates rolling Seasonal-Trend Decomposition using Loess (STL) with ensemble machine learning methods, specifically Random Forest and Gradient Boosting Machines, benchmarked against regularized linear models (Ridge and ElasticNet). To prevent data leakage, we implement a novel rolling STL decomposition approach that performs signal decomposition iteratively using only historical data available at each forecast origin. The methodology incorporates robust data preprocessing steps such as missing value imputation and normalization, along with temporal feature engineering involving lagged values, moving averages, rolling statistics, and year-on-year growth rates. To ensure reliable performance evaluation, we adopt an expanding window cross-validation strategy tailored for time series data across three temporal folds spanning 1999–2020. Among the models evaluated, Gradient Boosting demonstrates superior performance with an \documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$\hbox {R}^{2}$$\end{document} of 0.988 (\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$\pm 0.016$$\end{document}), MAPE of 3.6% (\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$\pm 2.3\%$$\end{document}), and RMSE of 45.8 MT (\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$\pm 35.2$$\end{document}), achieving 72% lower MAE compared to Ridge regression and outperforming Random Forest by 72% in mean absolute error. In the second component, we construct India’s global CNSL trade network spanning 1999–2020 and apply five centrality measures Degree, Closeness, Betweenness, Eigenvector, and PageRank to characterize its structure and identify key trading nodes. To further assess concentration and dependency, we introduce a novel Source-Importer Ratio metric, revealing pronounced disparities in trade influence, with differences of over 50-fold in degree centrality and 43-fold in PageRank across countries. The network analysis identifies India as the dominant hub with maximal degree centrality (1.0) and PageRank (0.461), while all importer countries exhibit uniformly low centrality scores (0.0196), confirming a star-like network topology with 52 nodes and 51 edges. By combining high-accuracy forecasting with network-driven diagnostics, this integrated approach provides a decision-support framework tailored to the needs of policymakers, exporters, and agri-business strategists. The study concludes with policy suggestions aimed at strengthening supply chain resilience, mitigating trade risks, and promoting export diversification. All code, data, and trained models are made publicly available to support reproducibility and adaptation to other perennial crop systems. Future work will extend the framework by integrating exogenous drivers such as climatic indicators and global price trends, and by updating the network analysis with post-2020 data to capture pandemic-induced structural changes in global trade patterns.

## Full-text entities

- **Chemicals:** CNSL (-)

## Figures

12 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12764762/full.md

---
Source: https://tomesphere.com/paper/PMC12764762