AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video   Understanding

Alessandro Suglia; Claudio Greco; Katie Baker; Jose L. Part; Ioannis; Papaioannou; Arash Eshghi; Ioannis Konstas; Oliver Lemon

arXiv:2406.13807·cs.CV·June 24, 2024·1 cites

AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding

Alessandro Suglia, Claudio Greco, Katie Baker, Jose L. Part, Ioannis, Papaioannou, Arash Eshghi, Ioannis Konstas, Oliver Lemon

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

AlanaVLM is a 7-billion-parameter multimodal model trained on egocentric videos, achieving state-of-the-art results in embodied video question answering and advancing AI's ability to understand human-centric environments.

Contribution

The paper introduces EVUD dataset, proposes AlanaVLM model, and demonstrates its superior performance on embodied video understanding benchmarks.

Findings

01

Achieves 3.6% better performance than GPT-4-based models on OpenEQA.

02

Outperforms Claude 3 and Gemini Pro Vision 1.0 in embodied video question answering.

03

Shows competitive results with GPT-4V, especially in spatial reasoning.

Abstract

AI personal assistants deployed via robots or wearables require embodied understanding to collaborate with humans effectively. However, current Vision-Language Models (VLMs) primarily focus on third-person view videos, neglecting the richness of egocentric perceptual experience. To address this gap, we propose three key contributions. First, we introduce the Egocentric Video Understanding Dataset (EVUD) for training VLMs on video captioning and question answering tasks specific to egocentric videos. Second, we present AlanaVLM, a 7B parameter VLM trained using parameter-efficient methods on EVUD. Finally, we evaluate AlanaVLM's capabilities on OpenEQA, a challenging benchmark for embodied video question answering. Our model achieves state-of-the-art performance, outperforming open-source models including strong Socratic models using GPT-4 as a planner by 3.6%. Additionally, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alanaai/evud
noneOfficial

Models

🤗
AlanaAI/AlanaVLM
model

Datasets

AlanaAI/EVUD
dataset· 90 dl
90 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Analysis and Summarization

MethodsAttention Is All You Need · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam · Linear Layer · Absolute Position Encodings