UpstreamQA: A Modular Framework for Explicit Reasoning on Video Question Answering Tasks

Jason Nguyen; Ameet Rao; Alexander Chang; Ishaan Kumar; Erin Tan

arXiv:2604.23145·cs.CV·April 28, 2026

UpstreamQA: A Modular Framework for Explicit Reasoning on Video Question Answering Tasks

Jason Nguyen, Ameet Rao, Alexander Chang, Ishaan Kumar, Erin Tan

PDF

TL;DR

UpstreamQA introduces a modular framework that combines explicit upstream reasoning modules with large multimodal models to improve interpretability and performance in Video Question Answering tasks.

Contribution

The paper presents a novel modular framework that disentangles and evaluates core video reasoning components using explicit reasoning modules before downstream answering.

Findings

01

Explicit reasoning boosts VideoQA interpretability and performance.

02

Introducing reasoning modules can sometimes degrade performance at high baseline levels.

03

UpstreamQA enhances diagnostic transparency in multimodal video understanding.

Abstract

Video Question Answering (VideoQA) demands models that jointly reason over spatial, temporal, and linguistic cues. However, the task's inherent complexity often requires multi-step reasoning that current large multimodal models (LMMs) perform implicitly, leaving their internal decision process opaque. In contrast, large reasoning models (LRMs) explicitly generate intermediate logical steps that enhance interpretability and can improve multi-hop reasoning accuracy. Yet, these models are not designed for native video understanding, as they typically rely on static frame sampling. We propose UpstreamQA, a modular framework that disentangles and evaluates core video reasoning components through explicit upstream reasoning modules. Specifically, we employ multimodal LRMs to perform object identification and scene context generation before passing enriched reasoning traces to downstream LMMs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.