Procedure Learning via Regularized Gromov-Wasserstein Optimal Transport

Syed Ahmed Mahmood; Ali Shah Ali; Umer Ahmed; Fawad Javed Fateh; M. Zeeshan Zia; Quoc-Huy Tran

arXiv:2507.15540·cs.CV·November 13, 2025

Procedure Learning via Regularized Gromov-Wasserstein Optimal Transport

Syed Ahmed Mahmood, Ali Shah Ali, Umer Ahmed, Fawad Javed Fateh, M. Zeeshan Zia, Quoc-Huy Tran

PDF

Open Access

TL;DR

This paper introduces a self-supervised procedure learning method that uses regularized Gromov-Wasserstein optimal transport with contrastive regularization to accurately discover key steps and their order from unlabeled videos, overcoming issues of order variation and redundant frames.

Contribution

It proposes a novel framework combining Gromov-Wasserstein optimal transport with a structural prior and contrastive regularization for improved procedure learning from videos.

Findings

01

Outperforms prior methods on egocentric and third-person benchmarks.

02

Effectively handles order variations and redundant frames.

03

Demonstrates superior accuracy in key step discovery.

Abstract

We study self-supervised procedure learning, which discovers key steps and their order from a set of unlabeled videos. Previous methods typically learn frame-to-frame correspondences between videos before determining key steps and their order. However, their performance often suffers from order variations, background/redundant frames, and repeated actions. To overcome these challenges, we propose a self-supervised framework, which utilizes a fused Gromov-Wasserstein optimal transport with a structural prior for frame-to-frame mapping. However, optimizing only for the above temporal alignment may lead to degenerate solutions, where all frames are mapped to a small cluster in the embedding space and thus every video is assigned to just one key step. To address that issue, we integrate a contrastive regularization, which maps different frames to various points, avoiding trivial solutions.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging