QCaption: Video Captioning and Q&A through Fusion of Large Multimodal Models

Jiale Wang; Gee Wah Ng; Lee Onn Mak; Randall Cher; Ng Ding Hei Ryan; Davis Wang

arXiv:2601.06566·cs.CV·January 13, 2026

QCaption: Video Captioning and Q&A through Fusion of Large Multimodal Models

Jiale Wang, Gee Wah Ng, Lee Onn Mak, Randall Cher, Ng Ding Hei Ryan, Davis Wang

PDF

Open Access

TL;DR

QCaption presents a fusion-based approach combining key frame extraction, multimodal analysis, and language models to significantly improve video captioning and Q&A performance, enabling efficient on-premises deployment.

Contribution

The paper introduces QCaption, a novel multimodal fusion pipeline that enhances video analytics by integrating key frame extraction, large multimodal models, and language models, with comprehensive benchmarking.

Findings

01

Up to 44.2% improvement in video captioning

02

Up to 48.9% improvement in video Q&A

03

Demonstrates effectiveness of model fusion in video analytics

Abstract

This paper introduces QCaption, a novel video captioning and Q&A pipeline that enhances video analytics by fusing three models: key frame extraction, a Large Multimodal Model (LMM) for image-text analysis, and a Large Language Model (LLM) for text analysis. This approach enables integrated analysis of text, images, and video, achieving performance improvements over existing video captioning and Q&A models; all while remaining fully self-contained, adept for on-premises deployment. Experimental results using QCaption demonstrated up to 44.2% and 48.9% improvements in video captioning and Q&A tasks, respectively. Ablation studies were also performed to assess the role of LLM on the fusion on the results. Moreover, the paper proposes and evaluates additional video captioning approaches, benchmarking them against QCaption and existing methodologies. QCaption demonstrate the potential of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques