HENASY: Learning to Assemble Scene-Entities for Egocentric   Video-Language Model

Khoa Vo; Thinh Phan; Kashu Yamazaki; Minh Tran; Ngan Le

arXiv:2406.00307·cs.CV·November 4, 2024

HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model

Khoa Vo, Thinh Phan, Kashu Yamazaki, Minh Tran, Ngan Le

PDF

Open Access

TL;DR

HENASY introduces a hierarchical, entity-based approach for egocentric video-language modeling, enhancing interpretability and fine-grained understanding by explicitly assembling scene entities and modeling their relationships over time.

Contribution

The paper proposes HENASY, a novel compositional framework that explicitly assembles scene entities in egocentric videos, improving interpretability and fine-grained multimodal understanding.

Findings

01

Strong interpretability demonstrated through visual grounding.

02

Competitive performance on five downstream tasks.

03

Effective entity-centric understanding with multi-grained contrastive losses.

Abstract

Current video-language models (VLMs) rely extensively on instance-level alignment between video and language modalities, which presents two major limitations: (1) visual reasoning disobeys the natural perception that humans do in first-person perspective, leading to a lack of reasoning interpretation; and (2) learning is limited in capturing inherent fine-grained relationships between two modalities. In this paper, we take an inspiration from human perception and explore a compositional approach for egocentric video representation. We introduce HENASY (Hierarchical ENtities ASsemblY), which includes a spatiotemporal token grouping mechanism to explicitly assemble dynamically evolving scene entities through time and model their relationship for video representation. By leveraging compositional structure understanding, HENASY possesses strong interpretability via visual grounding with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNarrative Theory and Analysis