Grid-LOGAT: Grid Based Local and Global Area Transcription for Video Question Answering

Md Intisar Chowdhury; Kittinun Aukkapinyo; Hiroshi Fujimura; Joo Ann Woo; Wasu Wasusatein; Fadoua Ghourabi

arXiv:2505.24371·cs.CV·July 29, 2025

Grid-LOGAT: Grid Based Local and Global Area Transcription for Video Question Answering

Md Intisar Chowdhury, Kittinun Aukkapinyo, Hiroshi Fujimura, Joo Ann Woo, Wasu Wasusatein, Fadoua Ghourabi

PDF

Open Access

TL;DR

Grid-LoGAT introduces a grid-based approach for extracting detailed local and global visual information from videos, enhancing VideoQA performance while maintaining privacy by processing data on edge devices and in the cloud.

Contribution

The paper presents a novel grid-based visual prompting method for VideoQA, improving transcript quality and accuracy over previous non-grid approaches.

Findings

01

Outperforms state-of-the-art methods on NExT-QA and STAR-QA datasets.

02

Achieves 65.9% and 50.11% accuracy respectively.

03

Surpasses non-grid version by 24 points on localization questions.

Abstract

In this paper, we propose a Grid-based Local and Global Area Transcription (Grid-LoGAT) system for Video Question Answering (VideoQA). The system operates in two phases. First, extracting text transcripts from video frames using a Vision-Language Model (VLM). Next, processing questions using these transcripts to generate answers through a Large Language Model (LLM). This design ensures image privacy by deploying the VLM on edge devices and the LLM in the cloud. To improve transcript quality, we propose grid-based visual prompting, which extracts intricate local details from each grid cell and integrates them with global information. Evaluation results show that Grid-LoGAT, using the open-source VLM (LLaVA-1.6-7B) and LLM (Llama-3.1-8B), outperforms state-of-the-art methods with similar baseline models on NExT-QA and STAR-QA datasets with an accuracy of 65.9% and 50.11% respectively.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning