A Stitch in Time: Learning Procedural Workflow via Self-Supervised Plackett-Luce Ranking
Chengan Che, Chao Wang, Xinyue Chen, Sophia Tsoka, Luis C. Garcia-Peraza-Herrera

TL;DR
This paper introduces PL-Stitch, a self-supervised learning framework that leverages the temporal order of video frames using Plackett-Luce ranking to improve procedural activity recognition in videos.
Contribution
It proposes a novel self-supervised approach with probabilistic objectives based on Plackett-Luce model to enhance understanding of procedural workflows in videos.
Findings
Achieves +11.4 percentage points in surgical phase recognition accuracy.
Attains +5.7 percentage points in cooking action segmentation accuracy.
Outperforms existing methods across five surgical and cooking benchmarks.
Abstract
Procedural activities, ranging from routine cooking to complex surgical operations, are highly structured sequences of actions performed in a specific temporal order. Despite the success of current self-supervised learning (SSL) methods on static images and short clips, these models often overlook the underlying sequential structure of such activities. We expose this lack of procedural awareness with a motivating experiment: models pretrained on forward and time-reversed sequences produce highly similar features, confirming that their representations are blind to the underlying procedural order. To address this shortcoming, we propose PL-Stitch, a self-supervised framework that harnesses the inherent temporal order of video frames as a powerful supervisory signal. Our approach integrates two novel probabilistic objectives based on the Plackett-Luce (PL) model. The primary PL objective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSurgical Simulation and Training · Human Pose and Action Recognition · Multimodal Machine Learning Applications
