# Health state prediction with reinforcement learning for predictive maintenance

**Authors:** Anastasis Aglogallos, Alexandros Bousdekis, Stefanos Kontos, Gregoris Mentzas

PMC · DOI: 10.3389/frai.2025.1720140 · Frontiers in Artificial Intelligence · 2026-01-12

## TL;DR

This paper explores using reinforcement learning for predictive maintenance in manufacturing, showing that certain algorithms perform better in different scenarios.

## Contribution

The novelty lies in applying four model-free RL algorithms to health state prediction, comparing their performance in structured and unstructured environments.

## Key findings

- PPO and SAC show stable and efficient performance in predictive maintenance tasks.
- SAC excels in structured environments, while PPO generalizes well.
- DDPG underperforms due to insufficient exploration.

## Abstract

Predictive maintenance has emerged as a critical strategy in modern manufacturing, in the frame of Industry 4.0, enabling proactive intervention before equipment failure. However, traditional machine learning approaches require extensive labeled data and lack adaptability to evolving operational conditions. On the other hand, Reinforcement Learning (RL) enables agents to learn optimal policies through interaction with the environment, eliminating the need for labeled datasets and naturally capturing the sequential, uncertain dynamics of equipment degradation.

In this paper, we propose an approach that incorporates four model-free RL algorithms, namely Proximal Policy Optimization (PPO), Advantage Actor-Critic (A2C), Deep Deterministic Policy Gradient (DDPG), and Soft Actor-Critic (SAC). We formulate the problem as a Markov Decision Process (MDP), which is solved with the aforementioned RL algorithms.

The proposed approach is validated in the context of CNC machine tool wear prediction, using sensor data from the 2010 PHM Society Data Challenge. We examine algorithmic performance across four custom made environments, corrective and non-corrective environments both with and without delay correction mechanisms in order to compare learning dynamics, convergence behavior, and generalization aspects. Our results reveal that PPO and SAC achieve the most stable and efficient performance, with SAC excelling in structured environments and PPO demonstrating robust generalization. A2C shows consistent long-term learning, while DDPG underperforms due to insufficient exploration.

The findings highlight the potential of RL for predictive maintenance applications and underscore the importance of aligning algorithm design with environment characteristics and reward structures.

## Full-text entities

- **Mutations:** A2C

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12833388/full.md

## Figures

15 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12833388/full.md

## References

54 references — full list in the complete paper: https://tomesphere.com/paper/PMC12833388/full.md

---
Source: https://tomesphere.com/paper/PMC12833388