Semantic Event Graphs for Long-Form Video Question Answering

Aradhya Dixit; Tianxi Liang

arXiv:2601.06097·cs.CV·January 13, 2026

Semantic Event Graphs for Long-Form Video Question Answering

Aradhya Dixit, Tianxi Liang

PDF

Open Access 1 Video

TL;DR

This paper introduces Semantic Event Graphs, a symbolic approach that enhances long-form video question answering by efficiently capturing temporal interactions, significantly reducing token usage while maintaining high accuracy.

Contribution

The authors propose Semantic Event Graphs as a novel, lightweight symbolic interface that improves long-form video reasoning efficiency for vision-language models.

Findings

01

Achieves 65.0% accuracy on long videos with only 3.47k tokens per query.

02

Reduces token usage by 91.4% compared to full-log baselines.

03

Outperforms short-context baselines by maintaining long-range reasoning.

Abstract

Long-form video question answering remains challenging for modern vision-language models, which struggle to reason over hour-scale footage without exceeding practical token and compute budgets. Existing systems typically downsample frames or feed dense visual embeddings to large-context language models, trading off temporal coverage against cost. We propose Semantic Event Graphs (SEG), a lightweight symbolic interface between video and language that replaces raw frames with compact temporal interaction logs. Our pipeline detects and tracks objects with YOLOv11, converts proximity patterns into START/END human-object events, and organizes them into a Temporal Scene Graph (TSG). At inference time, a query-aware pruning module identifies anchor entities and lexically relevant events, returning only a small subgraph which is verbalized and passed to Gemini 2.5 Flash for answer generation.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Semantic Event Graphs for Long-Form Video Question Answering· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks