What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction

Sunny Panchal; Apratim Bhattacharyya; Guillaume Berger; Antoine Mercier; Cornelius Bohm; Florian Dietrichkeit; Reza Pourreza; Xuanlin Li; Pulkit Madan; Mingu Lee; Mark Todorovich; Ingo Bax; Roland Memisevic

arXiv:2407.08101·cs.CV·April 16, 2026

What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction

Sunny Panchal, Apratim Bhattacharyya, Guillaume Berger, Antoine Mercier, Cornelius Bohm, Florian Dietrichkeit, Reza Pourreza, Xuanlin Li, Pulkit Madan, Mingu Lee, Mark Todorovich, Ingo Bax, Roland Memisevic

PDF

1 Datasets 1 Video

TL;DR

This paper introduces a new benchmark and dataset for real-time, asynchronous human-AI interaction in fitness coaching, highlighting current model limitations and proposing a streaming baseline for timely feedback.

Contribution

The work presents the QEVD benchmark and dataset for live fitness coaching, and proposes a simple streaming baseline to improve asynchronous, situated AI interactions.

Findings

01

Existing vision-language models struggle with real-time, asynchronous feedback.

02

The QEVD benchmark enables evaluation of human-AI interaction in fitness scenarios.

03

A streaming baseline improves response timing for feedback.

Abstract

Vision-language models have shown impressive progress in recent years. However, existing models are largely limited to turn-based interactions, where each turn must be stepped (i.e., prompted) by the user. Open-ended, asynchronous interactions, where an AI model may proactively deliver timely responses or feedback based on the unfolding situation in real-time, are an open challenge. In this work, we present the QEVD benchmark and dataset, which explores human-AI interaction in the challenging, yet controlled, real-world domain of fitness coaching -- a task which intrinsically requires monitoring live user activity and providing immediate feedback. The benchmark requires vision-language models to recognize complex human actions, identify possible mistakes, and provide appropriate feedback in real-time. Our experiments reveal the limitations of existing state-of-the-art vision-language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Voxel51/qualcomm-exercise-video-dataset-benchmark
dataset· 2.5k dl
2.5k dl

Videos

What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction· slideslive