Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge
Giuseppe Lando, Rosario Forte, Antonino Furnari

TL;DR
This paper explores the use of multimodal large language models for real-time episodic memory question answering on edge devices, balancing privacy, latency, and accuracy.
Contribution
It introduces an asynchronous pipeline integrating streaming constraints for edge-based memory question answering with promising experimental results.
Findings
Edge implementation achieves 51.76% accuracy on a consumer GPU.
Scaling to enterprise hardware improves accuracy to 54.40%.
Edge solutions are competitive with cloud-based approaches.
Abstract
We investigate the feasibility of using Multimodal Large Language Models (MLLMs) for real-time online episodic memory question answering. While cloud offloading is common, it raises privacy and latency concerns for wearable assistants, hence we investigate implementation on the edge. We integrated streaming constraints into our question answering pipeline, which is structured into two asynchronous threads: a Descriptor Thread that continuously converts video into a lightweight textual memory, and a Question Answering (QA) Thread that reasons over the textual memory to answer queries. Experiments on the QAEgo4D-Closed benchmark analyze the performance of Multimodal Large Language Models (MLLMs) within strict resource boundaries, showing promising results also when compared to clound-based solutions. Specifically, an end-to-end configuration running on a consumer-grade 8GB GPU achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
