How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering?
Giuseppe Lando, Rosario Forte, Giovanni Maria Farinella, Antonino Furnari

TL;DR
This paper demonstrates that off-the-shelf multimodal large language models can perform online episodic memory video question answering efficiently without additional training, achieving competitive accuracy with minimal memory usage.
Contribution
The study introduces a novel pipeline converting streaming egocentric videos into lightweight textual memories for question answering, showing competitive performance without extra training.
Findings
Achieves 56.0% accuracy on QAEgo4D-Closed benchmark.
Uses only 3.6 kB per minute of storage, vastly more efficient than state-of-the-art.
Provides detailed ablations and insights into system components.
Abstract
We investigate whether off-the-shelf Multimodal Large Language Models (MLLMs) can tackle Online Episodic-Memory Video Question Answering (OEM-VQA) without additional training. Our pipeline converts a streaming egocentric video into a lightweight textual memory, only a few kilobytes per minute, via an MLLM descriptor module, and answers multiple-choice questions by querying this memory with an LLM reasoner module. On the QAEgo4D-Closed benchmark, our best configuration attains 56.0% accuracy with 3.6 kB per minute storage, matching the performance of dedicated state-of-the-art systems while being 10**4/10**5 times more memory-efficient. Extensive ablations provides insights into the role of each component and design choice, and highlight directions of improvement for future research.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
