Four Eyes Are Better Than Two: Harnessing the Collaborative Potential of Large Models via Differentiated Thinking and Complementary Ensembles

Jun Xie; Xiongjun Guan; Yingjian Zhu; Zhaoran Zhao; Xinming Wang; Hongzhu Yi; Feng Chen; Zhepeng Wang

arXiv:2505.16784·cs.CV·June 10, 2025

Four Eyes Are Better Than Two: Harnessing the Collaborative Potential of Large Models via Differentiated Thinking and Complementary Ensembles

Jun Xie, Xiongjun Guan, Yingjian Zhu, Zhaoran Zhao, Xinming Wang, Hongzhu Yi, Feng Chen, Zhepeng Wang

PDF

Open Access

TL;DR

This paper demonstrates how leveraging differentiated prompting and ensemble strategies with large multimodal models significantly improves video understanding performance, outperforming previous state-of-the-art methods.

Contribution

The paper introduces a novel approach combining prompt diversification and model ensembling to enhance large model performance on video understanding tasks.

Findings

01

Direct use of a single multimodal model surpasses previous SOTA.

02

Ensemble of periodic results further boosts performance.

03

Systematic exploration of prompt styles improves model guidance.

Abstract

In this paper, we present the runner-up solution for the Ego4D EgoSchema Challenge at CVPR 2025 (Confirmed on May 20, 2025). Inspired by the success of large models, we evaluate and leverage leading accessible multimodal large models and adapt them to video understanding tasks via few-shot learning and model ensemble strategies. Specifically, diversified prompt styles and process paradigms are systematically explored and evaluated to effectively guide the attention of large models, fully unleashing their powerful generalization and adaptability abilities. Experimental results demonstrate that, with our carefully designed approach, directly utilizing an individual multimodal model already outperforms the previous state-of-the-art (SOTA) method which includes several additional processes. Besides, an additional stage is further introduced that facilitates the cooperation and ensemble of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications

MethodsSoftmax · Attention Is All You Need