Semantic Event Graphs for Long-Form Video Question Answering
Aradhya Dixit, Tianxi Liang

TL;DR
This paper introduces Semantic Event Graphs, a symbolic approach that enhances long-form video question answering by efficiently capturing temporal interactions, significantly reducing token usage while maintaining high accuracy.
Contribution
The authors propose Semantic Event Graphs as a novel, lightweight symbolic interface that improves long-form video reasoning efficiency for vision-language models.
Findings
Achieves 65.0% accuracy on long videos with only 3.47k tokens per query.
Reduces token usage by 91.4% compared to full-log baselines.
Outperforms short-context baselines by maintaining long-range reasoning.
Abstract
Long-form video question answering remains challenging for modern vision-language models, which struggle to reason over hour-scale footage without exceeding practical token and compute budgets. Existing systems typically downsample frames or feed dense visual embeddings to large-context language models, trading off temporal coverage against cost. We propose Semantic Event Graphs (SEG), a lightweight symbolic interface between video and language that replaces raw frames with compact temporal interaction logs. Our pipeline detects and tracks objects with YOLOv11, converts proximity patterns into START/END human-object events, and organizes them into a Temporal Scene Graph (TSG). At inference time, a query-aware pruning module identifies anchor entities and lexically relevant events, returning only a small subgraph which is verbalized and passed to Gemini 2.5 Flash for answer generation.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks
