InstrAct: Towards Action-Centric Understanding in Instructional Videos

Zhuoyi Yang; Jiapeng Yu; Reuben Tan; Boyang Li; and Huijuan Xu

arXiv:2604.08762·cs.CV·April 13, 2026

InstrAct: Towards Action-Centric Understanding in Instructional Videos

Zhuoyi Yang, Jiapeng Yu, Reuben Tan, Boyang Li, and Huijuan Xu

PDF

TL;DR

InstrAct introduces a comprehensive pretraining framework for instructional videos, emphasizing action-centric representations to improve fine-grained action understanding and temporal modeling.

Contribution

It proposes novel data filtering, action-focused feature extraction, and auxiliary objectives, advancing action-centric video understanding beyond existing models.

Findings

01

Outperforms state-of-the-art VFMs on semantic reasoning tasks.

02

Improves procedural logic and fine-grained retrieval accuracy.

03

Enhances temporal structure modeling with DTW-Align and MAM.

Abstract

Understanding instructional videos requires recognizing fine-grained actions and modeling their temporal relations, which remains challenging for current Video Foundation Models (VFMs). This difficulty stems from noisy web supervision and a pervasive "static bias", where models rely on objects rather than motion cues. To address this, we propose InstrAction, a pretraining framework for instructional videos' action-centric representations. We first introduce a data-driven strategy, which filters noisy captions and generates action-centric hard negatives to disentangle actions from objects during contrastive learning. At the visual feature level, an Action Perceiver extracts motion-relevant tokens from redundant video encodings. Beyond contrastive learning, we introduce two auxiliary objectives: Dynamic Time Warping alignment (DTW-Align) for modeling sequential temporal structure, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.