Learning Action Hierarchies via Hybrid Geometric Diffusion
Arjun Ramesh Kaushik, Nalini K. Ratha, Venu Govindaraju

TL;DR
This paper introduces HybridTAS, a novel diffusion-based framework that leverages hyperbolic geometry to explicitly model the hierarchical structure of human actions, significantly improving temporal action segmentation performance.
Contribution
HybridTAS uniquely integrates Euclidean and hyperbolic geometries into diffusion models to exploit action hierarchies for better segmentation accuracy.
Findings
Achieves state-of-the-art results on GTEA, 50Salads, and Breakfast datasets.
Effectively models hierarchical action structures with hyperbolic geometry.
Demonstrates the benefit of coarse-to-fine denoising guided by action hierarchy.
Abstract
Temporal action segmentation is a critical task in video understanding, where the goal is to assign action labels to each frame in a video. While recent advances leverage iterative refinement-based strategies, they fail to explicitly utilize the hierarchical nature of human actions. In this work, we propose HybridTAS - a novel framework that incorporates a hybrid of Euclidean and hyperbolic geometries into the denoising process of diffusion models to exploit the hierarchical structure of actions. Hyperbolic geometry naturally provides tree-like relationships between embeddings, enabling us to guide the action label denoising process in a coarse-to-fine manner: higher diffusion timesteps are influenced by abstract, high-level action categories (root nodes), while lower timesteps are refined using fine-grained action classes (leaf nodes). Extensive experiments on three benchmark datasets,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis · Human Motion and Animation
