Team of One: Cracking Complex Video QA with Model Synergy

Jun Xie; Zhaoran Zhao; Xiongjun Guan; Yingjian Zhu; Hongzhu Yi; Xinming Wang; Feng Chen; Zhepeng Wang

arXiv:2507.13820·cs.CV·July 21, 2025

Team of One: Cracking Complex Video QA with Model Synergy

Jun Xie, Zhaoran Zhao, Xiongjun Guan, Yingjian Zhu, Hongzhu Yi, Xinming Wang, Feng Chen, Zhepeng Wang

PDF

Open Access

TL;DR

This paper introduces a novel multi-model reasoning framework for open-ended video question answering, significantly improving reasoning depth, robustness, and generalization on complex real-world datasets without retraining models.

Contribution

It presents a prompting-and-response integration mechanism that coordinates multiple heterogeneous Video-Language Models via structured chains of thought, guided by an external LLM for response selection and fusion.

Findings

01

Outperforms existing baselines across all evaluation metrics.

02

Demonstrates superior generalization and robustness in complex scenarios.

03

Provides a lightweight, extensible reasoning strategy without retraining models.

Abstract

We propose a novel framework for open-ended video question answering that enhances reasoning depth and robustness in complex real-world scenarios, as benchmarked on the CVRR-ES dataset. Existing Video-Large Multimodal Models (Video-LMMs) often exhibit limited contextual understanding, weak temporal modeling, and poor generalization to ambiguous or compositional queries. To address these challenges, we introduce a prompting-and-response integration mechanism that coordinates multiple heterogeneous Video-Language Models (VLMs) via structured chains of thought, each tailored to distinct reasoning pathways. An external Large Language Model (LLM) serves as an evaluator and integrator, selecting and fusing the most reliable responses. Extensive experiments demonstrate that our method significantly outperforms existing baselines across all evaluation metrics, showcasing superior generalization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning