BT-Adapter: Video Conversation is Feasible Without Video Instruction   Tuning

Ruyang Liu; Chen Li; Yixiao Ge; Ying Shan; Thomas H. Li; and Ge Li

arXiv:2309.15785·cs.CV·June 28, 2024·2 cites

BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning

Ruyang Liu, Chen Li, Yixiao Ge, Ying Shan, Thomas H. Li, and Ge Li

PDF

Open Access 1 Repo

TL;DR

BT-Adapter enables effective video conversation capabilities in image-language models by adding a lightweight temporal modeling branch, achieving state-of-the-art results without extensive video instruction tuning or high GPU costs.

Contribution

The paper introduces BT-Adapter, a plug-and-play temporal modeling module that extends image-language models to video tasks without retraining the entire backbone.

Findings

01

Achieves state-of-the-art zero-shot video task performance with fewer GPU resources.

02

Outperforms existing video chatbots without video instruction tuning.

03

Sets new benchmarks in video chatting with instruction tuning, surpassing previous SOTA results.

Abstract

The recent progress in Large Language Models (LLM) has spurred various advancements in image-language conversation agents, while how to build a proficient video-based dialogue system is still under exploration. Considering the extensive scale of LLM and visual backbone, minimal GPU memory is left for facilitating effective temporal modeling, which is crucial for comprehending and providing feedback on videos. To this end, we propose Branching Temporal Adapter (BT-Adapter), a novel method for extending image-language pretrained models into the video domain. Specifically, BT-Adapter serves as a plug-and-use temporal modeling branch alongside the pretrained visual encoder, which is tuned while keeping the backbone frozen. Just pretrained once, BT-Adapter can be seamlessly integrated into all image conversation models using this version of CLIP, enabling video conversations without the need…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

farewellthree/BT-Adapter
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning

MethodsContrastive Language-Image Pre-training · Adapter