Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model
Lu Xu, Sijie Zhu, Chunyuan Li, Chia-Wen Kuo, Fan Chen, Xinyao Wang,, Guang Chen, Dawei Du, Ye Yuan, Longyin Wen

TL;DR
This paper introduces EditVid-QA, a new benchmark for video question answering on edited social media videos, highlighting the domain gap in current models and proposing training data and evaluation improvements.
Contribution
The paper creates a novel benchmark for edited videos, analyzes existing models' poor performance, and proposes training data and evaluation protocols to enhance generalization.
Findings
Open-source LMMs perform poorly on edited videos.
Training on both raw and edited videos improves performance.
GPT-4 evaluation avoids 'sorry' attack issues.
Abstract
The emerging video LMMs (Large Multimodal Models) have achieved significant improvements on generic video understanding in the form of VQA (Visual Question Answering), where the raw videos are captured by cameras. However, a large portion of videos in real-world applications are edited videos, \textit{e.g.}, users usually cut and add effects/modifications to the raw video before publishing it on social media platforms. The edited videos usually have high view counts but they are not covered in existing benchmarks of video LMMs, \textit{i.e.}, ActivityNet-QA, or VideoChatGPT benchmark. In this paper, we leverage the edited videos on a popular short video platform, \textit{i.e.}, TikTok, and build a video VQA benchmark (named EditVid-QA) covering four typical editing categories, i.e., effect, funny, meme, and game. Funny and meme videos benchmark nuanced understanding and high-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Storytelling and Education
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Sparse Evolutionary Training · Cosine Annealing · Residual Connection · Discriminative Fine-Tuning · Softmax · Layer Normalization · GPT · Byte Pair Encoding
