UpstreamQA: A Modular Framework for Explicit Reasoning on Video Question Answering Tasks
Jason Nguyen, Ameet Rao, Alexander Chang, Ishaan Kumar, Erin Tan

TL;DR
UpstreamQA introduces a modular framework that combines explicit upstream reasoning modules with large multimodal models to improve interpretability and performance in Video Question Answering tasks.
Contribution
The paper presents a novel modular framework that disentangles and evaluates core video reasoning components using explicit reasoning modules before downstream answering.
Findings
Explicit reasoning boosts VideoQA interpretability and performance.
Introducing reasoning modules can sometimes degrade performance at high baseline levels.
UpstreamQA enhances diagnostic transparency in multimodal video understanding.
Abstract
Video Question Answering (VideoQA) demands models that jointly reason over spatial, temporal, and linguistic cues. However, the task's inherent complexity often requires multi-step reasoning that current large multimodal models (LMMs) perform implicitly, leaving their internal decision process opaque. In contrast, large reasoning models (LRMs) explicitly generate intermediate logical steps that enhance interpretability and can improve multi-hop reasoning accuracy. Yet, these models are not designed for native video understanding, as they typically rely on static frame sampling. We propose UpstreamQA, a modular framework that disentangles and evaluates core video reasoning components through explicit upstream reasoning modules. Specifically, we employ multimodal LRMs to perform object identification and scene context generation before passing enriched reasoning traces to downstream LMMs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
