# AVPENet: Pain estimation from audio-visual fusion of non-speech sounds

**Authors:** Sami Naouali, Oussama El Othmani

PMC · DOI: 10.1371/journal.pdig.0001301 · PLOS Digital Health · 2026-03-27

## TL;DR

AVPENet is an AI system that estimates pain in non-verbal patients by combining audio and visual cues, offering more objective and consistent pain assessments than current methods.

## Contribution

AVPENet introduces a novel cross-modal attention-based fusion network for pain estimation using non-speech audio and facial expressions.

## Key findings

- AVPENet achieved a mean absolute error of 0.89 on a 0–10 pain scale, outperforming audio-only and visual-only approaches.
- The model demonstrated robust generalization across age groups with mean absolute errors of 0.94 for neonates and 0.91 for adults.
- AVPENet maintained a Pearson correlation coefficient of 0.89 with ground truth annotations and 81.4% accuracy for three-class pain categorization.

## Abstract

Pain assessment in non-verbal patients, including neonates and unconscious adults, remains a critical challenge in clinical practice. Current pain scales rely heavily on observer interpretation and may lack objectivity, introducing significant inter-rater variability. We propose a novel multimodal deep learning framework that estimates continuous pain intensity by fusing non-speech audio cues with facial expressions. Our approach addresses the critical need for objective pain assessment in vulnerable populations unable to self-report. We developed a cross-modal attention-based fusion network combining spectrogram-derived audio embeddings with facial action unit features. The model was trained and validated on 3,247 audio-visual recordings from 428 subjects, including 215 neonates and 213 adults, across three distinct pain intensity levels. We employed a ResNet-based audio encoder for mel-spectrogram processing and a facial landmark convolutional neural network for expression analysis, integrated through a transformer-based fusion module that learns complementary relationships between modalities. Our model achieved a mean absolute error of 0.89 on a 0–10 pain scale, significantly outperforming audio-only approaches (mean absolute error 1.47, 39% improvement) and visual-only baselines (mean absolute error 1.23, 28% improvement). Cross-age group validation demonstrated robust generalization with mean absolute errors of 0.94 for neonates and 0.91 for adults. The model maintained a Pearson correlation coefficient of 0.89 with ground truth annotations and achieved 81.4% accuracy for three-class pain categorization. Audio-visual fusion significantly enhances pain estimation accuracy across diverse age groups and clinical scenarios. This approach offers substantial potential for objective, automated pain monitoring in clinical settings, particularly for vulnerable populations unable to self-report pain.

Pain assessment in patients who cannot verbally communicate—including newborns, unconscious adults, and individuals with cognitive impairments—relies on subjective observer interpretation, introducing significant variability that can lead to inadequate treatment. We developed AVPENet, an artificial intelligence system that automatically estimates pain intensity by analyzing both non-speech vocalizations (cries, moans) and facial expressions simultaneously. Our system learns how these two information sources complement each other: for example, a cry accompanied by facial grimacing indicates stronger pain than ambiguous facial changes alone. We tested AVPENet on 3,247 recordings from 428 patients, including both newborns and adults, achieving accuracy approaching that of trained clinical observers while providing perfectly consistent assessments. The system maintained reliable performance even with background noise and facial coverings, suggesting practical feasibility for real hospital environments. Continuous automated monitoring could alert staff immediately when vulnerable patients experience pain, enabling faster intervention than current practice where nurses assess pain only periodically. While external validation and broader demographic testing are still needed, this work demonstrates that multimodal artificial intelligence can provide objective, consistent pain measurement for populations most dependent on our recognition of their suffering.

## Full-text entities

- **Diseases:** Pain (MESH:D010146)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13029810/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13029810/full.md

## References

66 references — full list in the complete paper: https://tomesphere.com/paper/PMC13029810/full.md

---
Source: https://tomesphere.com/paper/PMC13029810