Generative Action Tell-Tales: Assessing Human Motion in Synthesized Videos

Xavier Thomas; Youngsun Lim; Ananya Srinivasan; Audrey Zheng; Deepti Ghadiyaram

arXiv:2512.01803·cs.CV·December 4, 2025

Generative Action Tell-Tales: Assessing Human Motion in Synthesized Videos

Xavier Thomas, Youngsun Lim, Ananya Srinivasan, Audrey Zheng, Deepti Ghadiyaram

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a novel evaluation metric for human action quality in generated videos, combining appearance and skeletal features to better assess motion plausibility and temporal consistency, outperforming existing methods.

Contribution

A new action quality metric based on a learned latent space of real human motions, integrating skeletal geometry with appearance features for improved evaluation.

Findings

01

Achieves over 68% improvement over state-of-the-art methods on benchmark.

02

Correlates more strongly with human perception of action quality.

03

Performs well on external established benchmarks.

Abstract

Despite rapid advances in video generative models, robust metrics for evaluating visual and temporal correctness of complex human actions remain elusive. Critically, existing pure-vision encoders and Multimodal Large Language Models (MLLMs) are strongly appearance-biased, lack temporal understanding, and thus struggle to discern intricate motion dynamics and anatomical implausibilities in generated videos. We tackle this gap by introducing a novel evaluation metric derived from a learned latent space of real-world human actions. Our method first captures the nuances, constraints, and temporal smoothness of real-world motion by fusing appearance-agnostic human skeletal geometry features with appearance-based features. We posit that this combined feature space provides a robust representation of action plausibility. Given a generated video, our metric quantifies its action quality by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

dghadiya/TAG-Bench-Video
dataset· 1.6k dl
1.6k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Human Motion and Animation