Unifying Specialized Visual Encoders for Video Language Models

Jihoon Chung; Tyler Zhu; Max Gonzalez Saez-Diez; Juan Carlos Niebles; Honglu Zhou; Olga Russakovsky

arXiv:2501.01426·cs.CV·June 17, 2025

Unifying Specialized Visual Encoders for Video Language Models

Jihoon Chung, Tyler Zhu, Max Gonzalez Saez-Diez, Juan Carlos Niebles, Honglu Zhou, Olga Russakovsky

PDF

Open Access 1 Repo

TL;DR

MERV enhances video understanding in VideoLLMs by integrating multiple specialized frozen visual encoders, leading to improved accuracy, faster training, and richer visual representations compared to single-encoder approaches.

Contribution

This paper introduces MERV, a novel multi-encoder framework that unifies diverse visual features for VideoLLMs, surpassing prior methods in accuracy and efficiency.

Findings

01

Up to 3.7% accuracy improvement over Video-LLaVA

02

2.2% better zero-shot Perception Test accuracy than SeViLA

03

Faster training with minimal parameter increase

Abstract

The recent advent of Large Language Models (LLMs) has ushered sophisticated reasoning capabilities into the realm of video through Video Large Language Models (VideoLLMs). However, VideoLLMs currently rely on a single vision encoder for all of their visual processing, which limits the amount and type of visual information that can be conveyed to the LLM. Our method, MERV, Multi-Encoder Representation of Videos, instead leverages multiple frozen visual encoders to create a unified representation of a video, providing the VideoLLM with a comprehensive set of specialized visual knowledge. Spatio-temporally aligning the features from each encoder allows us to tackle a wider range of open-ended and multiple-choice video understanding questions and outperform prior state-of-the-art works. MERV is up to 3.7% better in accuracy than Video-LLaVA across the standard suite video understanding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

princetonvisualai/merv
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Human Pose and Action Recognition

MethodsSparse Evolutionary Training