ViLCo-Bench: VIdeo Language COntinual learning Benchmark
Tianqi Tang, Shohreh Deldari, Hao Xue, Celso De Melo, Flora D. Salim

TL;DR
This paper introduces ViLCo-Bench, a comprehensive benchmark for video-language continual learning, along with a memory-efficient framework that tackles challenges like long videos, complex language, and text-video misalignment.
Contribution
The study presents the first dedicated video-language continual learning benchmark and a novel framework that improves memory efficiency and handles complex video-text tasks.
Findings
ViLCo-Bench offers a more complex and realistic evaluation environment.
The proposed framework effectively manages long videos and open-ended language queries.
Experimental results demonstrate improved performance over existing methods.
Abstract
Video language continual learning involves continuously adapting to information from video and text inputs, enhancing a model's ability to handle new tasks while retaining prior knowledge. This field is a relatively under-explored area, and establishing appropriate datasets is crucial for facilitating communication and research in this field. In this study, we present the first dedicated benchmark, ViLCo-Bench, designed to evaluate continual learning models across a range of video-text tasks. The dataset comprises ten-minute-long videos and corresponding language queries collected from publicly available datasets. Additionally, we introduce a novel memory-efficient framework that incorporates self-supervised learning and mimics long-term and short-term memory effects. This framework addresses challenges including memory complexity from long video clips, natural language complexity from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems
