SST-EM: Advanced Metrics for Evaluating Semantic, Spatial and Temporal   Aspects in Video Editing

Varun Biyyala; Bharat Chanderprakash Kathuria; Jialu Li; Youshan Zhang

arXiv:2501.07554·cs.CV·January 14, 2025

SST-EM: Advanced Metrics for Evaluating Semantic, Spatial and Temporal Aspects in Video Editing

Varun Biyyala, Bharat Chanderprakash Kathuria, Jialu Li, Youshan Zhang

PDF

1 Repo

TL;DR

SST-EM is a comprehensive evaluation framework for video editing that assesses semantic accuracy, spatial consistency, and temporal smoothness using advanced vision-language models, object detection, and temporal analysis.

Contribution

It introduces a novel, unified metric combining multiple components to evaluate video editing quality more effectively than traditional methods.

Findings

01

Outperforms traditional metrics in assessing semantic fidelity.

02

Effectively measures temporal consistency in edited videos.

03

Provides a publicly available source code for implementation.

Abstract

Video editing models have advanced significantly, but evaluating their performance remains challenging. Traditional metrics, such as CLIP text and image scores, often fall short: text scores are limited by inadequate training data and hierarchical dependencies, while image scores fail to assess temporal consistency. We present SST-EM (Semantic, Spatial, and Temporal Evaluation Metric), a novel evaluation framework that leverages modern Vision-Language Models (VLMs), Object Detection, and Temporal Consistency checks. SST-EM comprises four components: (1) semantic extraction from frames using a VLM, (2) primary object tracking with Object Detection, (3) focused object refinement via an LLM agent, and (4) temporal consistency assessment using a Vision Transformer (ViT). These components are integrated into a unified metric with weights derived from human evaluations and regression…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

custommetrics-sst/sst_customevaluationmetrics
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAbsolute Position Encodings · Adam · Residual Connection · Dropout · Softmax · Byte Pair Encoding · Linear Layer · Attention Is All You Need · Vision Transformer · Multi-Head Attention