MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer
Minghao Zhu, Zhengpu Wang, Mengxian Hu, Ronghao Dang, Xiao Lin, Xun, Zhou, Chengju Liu, Qijun Chen

TL;DR
MoTE introduces a unified framework that balances generalization and specialization in video recognition by tuning multiple temporal experts and employing weight merging regularization, achieving state-of-the-art results.
Contribution
The paper proposes MoTE, a novel model that balances zero-shot and close-set performance through mixture of experts and weight merging regularization.
Findings
Achieves state-of-the-art results on Kinetics-400 & 600 datasets.
Balances zero-shot and close-set video recognition effectively.
Introduces weight merging regularization for expert knowledge preservation.
Abstract
Transferring visual-language knowledge from large-scale foundation models for video recognition has proved to be effective. To bridge the domain gap, additional parametric modules are added to capture the temporal information. However, zero-shot generalization diminishes with the increase in the number of specialized parameters, making existing works a trade-off between zero-shot and close-set performance. In this paper, we present MoTE, a novel framework that enables generalization and specialization to be balanced in one unified model. Our approach tunes a mixture of temporal experts to learn multiple task views with various degrees of data fitting. To maximally preserve the knowledge of each expert, we propose \emph{Weight Merging Regularization}, which regularizes the merging process of experts in weight space. Additionally with temporal feature modulation to regularize the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Natural Language Processing Techniques
