Scaling Vision Language Models for Pharmaceutical Long Form Video Reasoning on Industrial GenAI Platform

Suyash Mishra; Qiang Li; Srikanth Patil; Satyanarayan Pati; Baddu Narendra

arXiv:2601.04891·cs.CV·January 9, 2026

Scaling Vision Language Models for Pharmaceutical Long Form Video Reasoning on Industrial GenAI Platform

Suyash Mishra, Qiang Li, Srikanth Patil, Satyanarayan Pati, Baddu Narendra

PDF

Open Access

TL;DR

This paper evaluates the scalability and performance of vision language models in processing long-form pharmaceutical videos under industrial constraints, revealing key trade-offs, limitations, and practical insights for deployment.

Contribution

It introduces an industrial-scale architecture for multimodal reasoning, analyzes over 40 VLMs on benchmarks and proprietary data, and identifies critical factors affecting long-form video understanding.

Findings

01

SDPA attention improves efficiency 3-8x on commodity GPUs

02

Multimodality enhances task performance up to 8/12 domains

03

Temporal reasoning and video splitting pose significant challenges

Abstract

Vision Language Models (VLMs) have shown strong performance on multimodal reasoning tasks, yet most evaluations focus on short videos and assume unconstrained computational resources. In industrial settings such as pharmaceutical content understanding, practitioners must process long-form videos under strict GPU, latency, and cost constraints, where many existing approaches fail to scale. In this work, we present an industrial GenAI framework that processes over 200,000 PDFs, 25,326 videos across eight formats (e.g., MP4, M4V, etc.), and 888 multilingual audio files in more than 20 languages. Our study makes three contributions: (i) an industrial large-scale architecture for multimodal reasoning in pharmaceutical domains; (ii) empirical analysis of over 40 VLMs on two leading benchmarks (Video-MME and MMBench) and proprietary dataset of 25,326 videos across 14 disease areas; and (iii)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications