Learning Consistent Temporal Grounding between Related Tasks in Sports Coaching

Arushi Rai; Adriana Kovashka

arXiv:2603.18453·cs.CV·March 20, 2026

Learning Consistent Temporal Grounding between Related Tasks in Sports Coaching

Arushi Rai, Adriana Kovashka

PDF

Open Access

TL;DR

This paper introduces a method to improve temporal grounding in sports coaching videos by enforcing self-consistency between related tasks' attention maps, leading to significant accuracy gains without extra annotations.

Contribution

It proposes a novel self-consistency training approach that leverages related tasks to enhance temporal grounding in video language models without additional supervision.

Findings

01

Attention misallocation is a key bottleneck in temporal grounding.

02

Self-consistency training improves accuracy by up to 14.1%.

03

Method surpasses some closed-source models in sports coaching tasks.

Abstract

Video-LLMs often attend to irrelevant frames, which is especially detrimental for sports coaching tasks requiring precise temporal grounding. Yet obtaining frame-level supervision is challenging: expensive to collect from humans and unreliable from other models. We improve temporal grounding without additional annotations by exploiting the observation that related tasks, such as generation and verification, must attend to the same frames. We enforce this via a self-consistency objective over select visual attention maps of tightly-related tasks. Using VidDiffBench, which provides ground-truth keyframe annotations, we first validate that attention misallocation is a significant bottleneck. We then show that training with our objective yields gains of +3.0%, +14.1% accuracy and +0.9 BERTScore over supervised finetuning across three sports coaching tasks: Exact, FitnessQA, and ExpertAF,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Human Pose and Action Recognition · Explainable Artificial Intelligence (XAI)