MUPA: Towards Multi-Path Agentic Reasoning for Grounded Video Question Answering
Jisheng Dang, Huilin Song, Junbin Xiao, Bimei Wang, Han Peng, Haoxuan Li, Xun Yang, Meng Wang, Tat-Seng Chua

TL;DR
MUPA introduces a multi-path reasoning framework for grounded video question answering that improves grounding accuracy and achieves state-of-the-art results by unifying grounding, QA, and reflection in a cooperative system.
Contribution
The paper presents MUPA, a novel multi-path agentic approach that unifies grounding, question answering, and reflection to enhance grounded video QA performance.
Findings
Outperforms 7B-scale models with only 2B parameters.
Achieves new state-of-the-art accuracy on NExT-GQA and DeVE-QA datasets.
Improves grounding fidelity without sacrificing answer accuracy.
Abstract
Grounded Video Question Answering (Grounded VideoQA) requires aligning textual answers with explicit visual evidence. However, modern multimodal models often rely on linguistic priors and spurious correlations, resulting in poorly grounded predictions. In this work, we propose MUPA, a cooperative MUlti-Path Agentic approach that unifies video grounding, question answering, answer reflection and aggregation to tackle Grounded VideoQA. MUPA features three distinct reasoning paths on the interplay of grounding and QA agents in different chronological orders, along with a dedicated reflection agent to judge and aggregate the multi-path results to accomplish consistent QA and grounding. This design markedly improves grounding fidelity without sacrificing answer accuracy. Despite using only 2B parameters, our method outperforms all 7B-scale competitors. When scaled to 7B parameters, MUPA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
