Scene-Aware Vectorized Memory Multi-Agent Framework with Cross-Modal Differentiated Quantization VLMs for Visually Impaired Assistance

Xiangxiang Wang; Xuanyu Wang; YiJia Luo; Yongbin Yu; Manping Fan; Jingtao Zhang; Liyong Ren

arXiv:2508.18177·cs.CV·January 21, 2026

Scene-Aware Vectorized Memory Multi-Agent Framework with Cross-Modal Differentiated Quantization VLMs for Visually Impaired Assistance

Xiangxiang Wang, Xuanyu Wang, YiJia Luo, Yongbin Yu, Manping Fan, Jingtao Zhang, Liyong Ren

PDF

TL;DR

This paper introduces a novel, memory-efficient vision-language model and multi-agent system that significantly improve environmental perception for visually impaired individuals while reducing computational requirements.

Contribution

It presents a cross-modal differentiated quantization framework and a scene-aware vectorized memory multi-agent system, enhancing efficiency and integrated scene understanding in assistive technology.

Findings

01

Memory reduced from 38GB to 11.3GB with minimal performance loss.

02

Achieves 2.83-3.52s latency for initial speech output.

03

Maintains high accuracy on OCR-VQA and MMBench benchmarks.

Abstract

Visually impaired individuals face significant challenges in environmental perception. Traditional assistive technologies often lack adaptive intelligence, focusing on individual components rather than integrated systems. While Vision-Language Models (VLMs) offer a promising path to richer, integrated understanding, their deployment is severely limited by substantial computational requirements, demanding dozens of gigabytes of memory. To address these gaps in computational efficiency and integrated design, this study proposes a dual technological innovation framework: a cross-modal differentiated quantization framework for VLMs and a scene-aware vectorized memory multi-agent system. The quantization framework implements differentiated strategies, reducing memory from 38GB to 11.3GB. The multi-agent system uses vectorized memory and perception-memory-reasoning workflows to provide…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.