From Pixels to Graphs: using Scene and Knowledge Graphs for HD-EPIC VQA Challenge

Agnese Taluzzi; Davide Gesualdi; Riccardo Santambrogio; Chiara Plizzari; Francesca Palermo; Simone Mentasti; Matteo Matteucci

arXiv:2506.08553·cs.CV·June 11, 2025

From Pixels to Graphs: using Scene and Knowledge Graphs for HD-EPIC VQA Challenge

Agnese Taluzzi, Davide Gesualdi, Riccardo Santambrogio, Chiara Plizzari, Francesca Palermo, Simone Mentasti, Matteo Matteucci

PDF

Open Access

TL;DR

This paper introduces SceneNet and KnowledgeNet, two novel approaches utilizing scene and knowledge graphs for egocentric visual question answering, achieving a 44.21% accuracy on the HD-EPIC benchmark.

Contribution

The paper presents two new methods that integrate scene graphs and external commonsense knowledge for improved VQA performance in egocentric videos.

Findings

01

SceneNet captures detailed object interactions and spatial relations.

02

KnowledgeNet leverages external knowledge for reasoning beyond visual data.

03

Combined approach achieves 44.21% accuracy on HD-EPIC VQA Challenge.

Abstract

This report presents SceneNet and KnowledgeNet, our approaches developed for the HD-EPIC VQA Challenge 2025. SceneNet leverages scene graphs generated with a multi-modal large language model (MLLM) to capture fine-grained object interactions, spatial relationships, and temporally grounded events. In parallel, KnowledgeNet incorporates ConceptNet's external commonsense knowledge to introduce high-level semantic connections between entities, enabling reasoning beyond directly observable visual evidence. Each method demonstrates distinct strengths across the seven categories of the HD-EPIC benchmark, and their combination within our framework results in an overall accuracy of 44.21% on the challenge, highlighting its effectiveness for complex egocentric VQA tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Graph Neural Networks