CourseTimeQA: A Lecture-Video Benchmark and a Latency-Constrained Cross-Modal Fusion Method for Timestamped QA

Vsevolod Kovalev; Parteek Kumar

arXiv:2512.00360·cs.CL·December 2, 2025

CourseTimeQA: A Lecture-Video Benchmark and a Latency-Constrained Cross-Modal Fusion Method for Timestamped QA

Vsevolod Kovalev, Parteek Kumar

PDF

Open Access

TL;DR

This paper introduces CourseTimeQA, a timestamped question answering benchmark for lecture videos, and proposes CrossFusion-RAG, a latency-efficient cross-modal retrieval method that improves accuracy while maintaining low latency on a single GPU.

Contribution

The paper presents a new large-scale lecture video QA benchmark and a novel lightweight cross-modal retrieval method optimized for low latency and high accuracy.

Findings

01

CrossFusion-RAG improves nDCG@10 by 0.10 over strong baselines.

02

Achieves approximately 1.55 seconds median latency on a single A100 GPU.

03

Demonstrates robustness to ASR noise and provides detailed diagnostics.

Abstract

We study timestamped question answering over educational lecture videos under a single-GPU latency/memory budget. Given a natural-language query, the system retrieves relevant timestamped segments and synthesizes a grounded answer. We present CourseTimeQA (52.3 h, 902 queries across six courses) and a lightweight, latency-constrained cross-modal retriever (CrossFusion-RAG) that combines frozen encoders, a learned 512->768 vision projection, shallow query-agnostic cross-attention over ASR and frames with a temporal-consistency regularizer, and a small cross-attentive reranker. On CourseTimeQA, CrossFusion-RAG improves nDCG@10 by 0.10 and MRR by 0.08 over a strong BLIP-2 retriever while achieving approximately 1.55 s median end-to-end latency on a single A100. Closest comparators (zero-shot CLIP multi-frame pooling; CLIP + cross-encoder reranker + MMR; learned late-fusion gating;…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning