MoTE: Reconciling Generalization with Specialization for Visual-Language   to Video Knowledge Transfer

Minghao Zhu; Zhengpu Wang; Mengxian Hu; Ronghao Dang; Xiao Lin; Xun; Zhou; Chengju Liu; Qijun Chen

arXiv:2410.10589·cs.CV·October 15, 2024

MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer

Minghao Zhu, Zhengpu Wang, Mengxian Hu, Ronghao Dang, Xiao Lin, Xun, Zhou, Chengju Liu, Qijun Chen

PDF

Open Access 1 Repo

TL;DR

MoTE introduces a unified framework that balances generalization and specialization in video recognition by tuning multiple temporal experts and employing weight merging regularization, achieving state-of-the-art results.

Contribution

The paper proposes MoTE, a novel model that balances zero-shot and close-set performance through mixture of experts and weight merging regularization.

Findings

01

Achieves state-of-the-art results on Kinetics-400 & 600 datasets.

02

Balances zero-shot and close-set video recognition effectively.

03

Introduces weight merging regularization for expert knowledge preservation.

Abstract

Transferring visual-language knowledge from large-scale foundation models for video recognition has proved to be effective. To bridge the domain gap, additional parametric modules are added to capture the temporal information. However, zero-shot generalization diminishes with the increase in the number of specialized parameters, making existing works a trade-off between zero-shot and close-set performance. In this paper, we present MoTE, a novel framework that enables generalization and specialization to be balanced in one unified model. Our approach tunes a mixture of temporal experts to learn multiple task views with various degrees of data fitting. To maximally preserve the knowledge of each expert, we propose \emph{Weight Merging Regularization}, which regularizes the merging process of experts in weight space. Additionally with temporal feature modulation to regularize the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zmhh-h/mote
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Natural Language Processing Techniques