BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Ruyang Liu, Chen Li, Yixiao Ge, Ying Shan, Thomas H. Li, and Ge Li

TL;DR
BT-Adapter enables effective video conversation capabilities in image-language models by adding a lightweight temporal modeling branch, achieving state-of-the-art results without extensive video instruction tuning or high GPU costs.
Contribution
The paper introduces BT-Adapter, a plug-and-play temporal modeling module that extends image-language models to video tasks without retraining the entire backbone.
Findings
Achieves state-of-the-art zero-shot video task performance with fewer GPU resources.
Outperforms existing video chatbots without video instruction tuning.
Sets new benchmarks in video chatting with instruction tuning, surpassing previous SOTA results.
Abstract
The recent progress in Large Language Models (LLM) has spurred various advancements in image-language conversation agents, while how to build a proficient video-based dialogue system is still under exploration. Considering the extensive scale of LLM and visual backbone, minimal GPU memory is left for facilitating effective temporal modeling, which is crucial for comprehending and providing feedback on videos. To this end, we propose Branching Temporal Adapter (BT-Adapter), a novel method for extending image-language pretrained models into the video domain. Specifically, BT-Adapter serves as a plug-and-use temporal modeling branch alongside the pretrained visual encoder, which is tuned while keeping the backbone frozen. Just pretrained once, BT-Adapter can be seamlessly integrated into all image conversation models using this version of CLIP, enabling video conversations without the need…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
MethodsContrastive Language-Image Pre-training · Adapter
