Understanding Virality: A Rubric based Vision-Language Model Framework for Short-Form Edutainment Evaluation
Arnav Gupta, Gurekas Singh Sahney, Hardik Rathi, Abhishek Chandwani, Ishaan Gupta, Pratik Narang, Dhruv Kumar

TL;DR
This paper introduces a vision-language model framework that uses audiovisual features to evaluate and predict audience engagement in short-form edutainment videos, moving beyond traditional quality metrics.
Contribution
It presents a novel, data-driven evaluation method leveraging VLMs to extract interpretable audiovisual features and predict engagement, enhancing explainability and scalability.
Findings
Strong correlation between predicted and actual engagement.
Features provide interpretable insights into audiovisual influence.
Outperforms traditional metrics like SSIM and FID in engagement prediction.
Abstract
Evaluating short-form video content requires moving beyond surface-level quality metrics toward human-aligned, multimodal reasoning. While existing frameworks like VideoScore-2 assess visual and semantic fidelity, they do not capture how specific audiovisual attributes drive real audience engagement. In this work, we propose a data-driven evaluation framework that uses Vision-Language Models (VLMs) to extract unsupervised audiovisual features, clusters them into interpretable factors, and trains a regression-based evaluator to predict engagement on short-form edutainment videos. Our curated YouTube Shorts dataset enables systematic analysis of how VLM-derived features relate to human engagement behavior. Experiments show strong correlations between predicted and actual engagement, demonstrating that our lightweight, feature-based evaluator provides interpretable and scalable assessments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Image and Video Quality Assessment · Video Analysis and Summarization
