# Vision-based multimodal energy expenditure estimation for aerobic exercise in adults

**Authors:** Lei Jin, Shengxuming Zhang, Mingyao Shi, Long Yu, Mengyao Wang, Mingli Song, Xu Wen

PMC · DOI: 10.3389/fphys.2025.1666616 · Frontiers in Physiology · 2025-10-08

## TL;DR

This paper introduces a vision-based method using Transformers to estimate energy expenditure during aerobic exercise more accurately than existing models.

## Contribution

The first Transformer-based model for contactless energy expenditure estimation using skeleton and heart rate data.

## Key findings

- E3SFormer achieved 28.81% mean relative error with skeleton input alone.
- Adding heart rate and physical attributes reduced error to 15.32%, outperforming smartwatches and traditional methods.
- The model shows promise for contactless, multi-modal physiological analysis during exercise.

## Abstract

Estimating energy expenditure (EE) accurately and conveniently has always been a concern in sports science. Inspired by the success of Transformer in computer vision (CV), this paper proposed a Transformer-based method, aiming to promote the contactless and vision-based EE estimation.

We collected 16,526 video clips from 36 participants performing 6 common aerobic exercises, labeled with continuous calorie readings from COSMED K5. Then we specifically designed a novel approach called the Energy Expenditure Estimation Skeleton Transformer (E3SFormer) for EE estimation, featuring dual Transformer branches for simultaneous action recognition (AR) and EE regression. Comprehensive experiments were conducted to compare the EE estimation performance of our method with existing skeleton-based AR models, the traditional heart rate (HR) formula, and a smartwatch.

With pure skeleton input, our model yielded a 28.81% mean relative error (MRE), surpassing all comparative models. With adopting the heart rate and physical attributes of each participant as multi-modal input, our model achieved a 15.32% MRE, substantially better than other models. In comparison, the smartwatch showed an 18.10% MRE.

Extensive experimentation validates the effectiveness of E3SFormer, aiming to inspire further research in contactless measurement for EE. This study is the first attempt to estimating EE using Transformer, which can promote contactless and multi-modal physiology analysis for aerobic exercise.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12540514/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12540514/full.md

## References

47 references — full list in the complete paper: https://tomesphere.com/paper/PMC12540514/full.md

---
Source: https://tomesphere.com/paper/PMC12540514