HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model
Khoa Vo, Thinh Phan, Kashu Yamazaki, Minh Tran, Ngan Le

TL;DR
HENASY introduces a hierarchical, entity-based approach for egocentric video-language modeling, enhancing interpretability and fine-grained understanding by explicitly assembling scene entities and modeling their relationships over time.
Contribution
The paper proposes HENASY, a novel compositional framework that explicitly assembles scene entities in egocentric videos, improving interpretability and fine-grained multimodal understanding.
Findings
Strong interpretability demonstrated through visual grounding.
Competitive performance on five downstream tasks.
Effective entity-centric understanding with multi-grained contrastive losses.
Abstract
Current video-language models (VLMs) rely extensively on instance-level alignment between video and language modalities, which presents two major limitations: (1) visual reasoning disobeys the natural perception that humans do in first-person perspective, leading to a lack of reasoning interpretation; and (2) learning is limited in capturing inherent fine-grained relationships between two modalities. In this paper, we take an inspiration from human perception and explore a compositional approach for egocentric video representation. We introduce HENASY (Hierarchical ENtities ASsemblY), which includes a spatiotemporal token grouping mechanism to explicitly assemble dynamically evolving scene entities through time and model their relationship for video representation. By leveraging compositional structure understanding, HENASY possesses strong interpretability via visual grounding with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNarrative Theory and Analysis
