Beyond Raw Videos: Understanding Edited Videos with Large Multimodal   Model

Lu Xu; Sijie Zhu; Chunyuan Li; Chia-Wen Kuo; Fan Chen; Xinyao Wang,; Guang Chen; Dawei Du; Ye Yuan; Longyin Wen

arXiv:2406.10484·cs.CV·September 30, 2024

Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model

Lu Xu, Sijie Zhu, Chunyuan Li, Chia-Wen Kuo, Fan Chen, Xinyao Wang,, Guang Chen, Dawei Du, Ye Yuan, Longyin Wen

PDF

Open Access 1 Repo

TL;DR

This paper introduces EditVid-QA, a new benchmark for video question answering on edited social media videos, highlighting the domain gap in current models and proposing training data and evaluation improvements.

Contribution

The paper creates a novel benchmark for edited videos, analyzes existing models' poor performance, and proposes training data and evaluation protocols to enhance generalization.

Findings

01

Open-source LMMs perform poorly on edited videos.

02

Training on both raw and edited videos improves performance.

03

GPT-4 evaluation avoids 'sorry' attack issues.

Abstract

The emerging video LMMs (Large Multimodal Models) have achieved significant improvements on generic video understanding in the form of VQA (Visual Question Answering), where the raw videos are captured by cameras. However, a large portion of videos in real-world applications are edited videos, \textit{e.g.}, users usually cut and add effects/modifications to the raw video before publishing it on social media platforms. The edited videos usually have high view counts but they are not covered in existing benchmarks of video LMMs, \textit{i.e.}, ActivityNet-QA, or VideoChatGPT benchmark. In this paper, we leverage the edited videos on a popular short video platform, \textit{i.e.}, TikTok, and build a video VQA benchmark (named EditVid-QA) covering four typical editing categories, i.e., effect, funny, meme, and game. Funny and meme videos benchmark nuanced understanding and high-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

XenonLamb/EditVid-QA
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Storytelling and Education

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Sparse Evolutionary Training · Cosine Annealing · Residual Connection · Discriminative Fine-Tuning · Softmax · Layer Normalization · GPT · Byte Pair Encoding