LifelongMemory: Leveraging LLMs for Answering Queries in Long-form Egocentric Videos
Ying Wang, Yanlai Yang, Mengye Ren

TL;DR
LifelongMemory is a novel framework that uses large language models and video descriptions to answer questions about long egocentric videos, providing interpretable and confident responses.
Contribution
It introduces LifelongMemory, combining video description generation with LLM reasoning for improved long-form egocentric video question answering.
Findings
Achieves state-of-the-art on EgoSchema benchmark.
Performs competitively on Ego4D NLQ challenge.
Provides interpretable confidence and explanation modules.
Abstract
In this paper we introduce LifelongMemory, a new framework for accessing long-form egocentric videographic memory through natural language question answering and retrieval. LifelongMemory generates concise video activity descriptions of the camera wearer and leverages the zero-shot capabilities of pretrained large language models to perform reasoning over long-form video context. Furthermore, LifelongMemory uses a confidence and explanation module to produce confident, high-quality, and interpretable answers. Our approach achieves state-of-the-art performance on the EgoSchema benchmark for question answering and is highly competitive on the natural language query (NLQ) challenge of Ego4D. Code is available at https://github.com/agentic-learning-ai-lab/lifelong-memory.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
