Semi-Parametric Video-Grounded Text Generation
Sungdong Kim, Jin-Hwa Kim, Jiyoung Lee, Minjoon Seo

TL;DR
This paper introduces SeViT, a semi-parametric model for long video-grounded text generation that efficiently retrieves relevant frames and aggregates information, outperforming existing methods on multiple datasets.
Contribution
SeViT presents a novel semi-parametric approach combining non-parametric frame retrieval with parametric generation for scalable long video-language modeling.
Findings
Outperforms previous models on four datasets in accuracy and CIDEr.
Excels in long untrimmed videos and causal understanding.
Efficiently handles large video data without quadratic computational costs.
Abstract
Efficient video-language modeling should consider the computational cost because of a large, sometimes intractable, number of video frames. Parametric approaches such as the attention mechanism may not be ideal since its computational cost quadratically increases as the video length increases. Rather, previous studies have relied on offline feature extraction or frame sampling to represent the video efficiently, focusing on cross-modal modeling in short video clips. In this paper, we propose a semi-parametric video-grounded text generation model, SeViT, a novel perspective on scalable video-language modeling toward long untrimmed videos. Treating a video as an external data store, SeViT includes a non-parametric frame retriever to select a few query-relevant frames from the data store for a given query and a parametric generator to effectively aggregate the frames with the query via…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
