Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge

Giuseppe Lando; Rosario Forte; Antonino Furnari

arXiv:2602.22455·cs.CV·May 14, 2026

Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge

Giuseppe Lando, Rosario Forte, Antonino Furnari

PDF

TL;DR

This paper explores the use of multimodal large language models for real-time episodic memory question answering on edge devices, balancing privacy, latency, and accuracy.

Contribution

It introduces an asynchronous pipeline integrating streaming constraints for edge-based memory question answering with promising experimental results.

Findings

01

Edge implementation achieves 51.76% accuracy on a consumer GPU.

02

Scaling to enterprise hardware improves accuracy to 54.40%.

03

Edge solutions are competitive with cloud-based approaches.

Abstract

We investigate the feasibility of using Multimodal Large Language Models (MLLMs) for real-time online episodic memory question answering. While cloud offloading is common, it raises privacy and latency concerns for wearable assistants, hence we investigate implementation on the edge. We integrated streaming constraints into our question answering pipeline, which is structured into two asynchronous threads: a Descriptor Thread that continuously converts video into a lightweight textual memory, and a Question Answering (QA) Thread that reasons over the textual memory to answer queries. Experiments on the QAEgo4D-Closed benchmark analyze the performance of Multimodal Large Language Models (MLLMs) within strict resource boundaries, showing promising results also when compared to clound-based solutions. Specifically, an end-to-end configuration running on a consumer-grade 8GB GPU achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.