SemanticMoments: Training-Free Motion Similarity via Third Moment Features
Saar Huberman, Kfir Goldberg, Or Patashnik, Sagie Benaim, Ron Mokady

TL;DR
SemanticMoments introduces a training-free approach that leverages third moment features of pre-trained semantic models to improve motion similarity retrieval in videos, outperforming existing methods on new benchmarks.
Contribution
The paper presents SemanticMoments, a novel training-free method using higher-order moments of semantic features for motion similarity, addressing limitations of appearance-based and traditional motion inputs.
Findings
SemanticMoments outperforms RGB, flow, and text-supervised methods on benchmarks.
Existing models struggle to disentangle motion from appearance.
New benchmarks reveal the bias in current video representations.
Abstract
Retrieving videos based on semantic motion is a fundamental, yet unsolved, problem. Existing video representation approaches overly rely on static appearance and scene context rather than motion dynamics, a bias inherited from their training data and objectives. Conversely, traditional motion-centric inputs like optical flow lack the semantic grounding needed to understand high-level motion. To demonstrate this inherent bias, we introduce the SimMotion benchmarks, combining controlled synthetic data with a new human-annotated real-world dataset. We show that existing models perform poorly on these benchmarks, often failing to disentangle motion from appearance. To address this gap, we propose SemanticMoments, a simple, training-free method that computes temporal statistics (specifically, higher-order moments) over features from pre-trained semantic models. Across our benchmarks,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Human Motion and Animation
